class documentation

Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets as they could contain Javascript. ``comments``: Removes any comments. ``style``: Removes any style tags. ``inline_style`` Removes any style attributes. Defaults to the value of the ``style`` option. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` ``remove_tags``: A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. ``kill_tags``: A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site). ``safe_attrs``: A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. ``host_whitelist``: A list or set of hosts that you can use for embedded content (for content like ``<object>``, ``<link rel="stylesheet">``, etc). You can also implement/override the method ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) ``embedded``. Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. Note that you may also need to set ``whitelist_tags``. ``whitelist_tags``: A set of tags that can be included with ``host_whitelist``. The default is ``iframe`` and ``embed``; you may wish to include other tags like ``script``, or you may want to implement ``allow_embedded_url`` for more control. Set to None to include all tags. This modifies the document *in place*.

Method __call__ Cleans the document.
Method __init__ Undocumented
Method allow_element Decide whether an element is configured to be accepted or rejected.
Method allow_embedded_url Decide whether a URL that was found in an element's attributes or text if configured to be accepted or rejected.
Method allow_follow Override to suppress rel="nofollow" on some anchors.
Method clean_html Undocumented
Method kill_conditional_comments IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.
Class Variable add_nofollow Undocumented
Class Variable allow_tags Undocumented
Class Variable annoying_tags Undocumented
Class Variable comments Undocumented
Class Variable embedded Undocumented
Class Variable forms Undocumented
Class Variable frames Undocumented
Class Variable host_whitelist Undocumented
Class Variable javascript Undocumented
Class Variable kill_tags Undocumented
Class Variable links Undocumented
Class Variable meta Undocumented
Class Variable page_structure Undocumented
Class Variable processing_instructions Undocumented
Class Variable remove_tags Undocumented
Class Variable safe_attrs_only Undocumented
Class Variable scripts Undocumented
Class Variable style Undocumented
Class Variable whitelist_tags Undocumented
Instance Variable inline_style Undocumented
Instance Variable remove_unknown_tags Undocumented
Method _has_sneaky_javascript Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this.
Method _kill_elements Undocumented
Method _remove_javascript_link Undocumented
Class Variable _substitute_comments Undocumented
Class Variable _tag_link_attrs Undocumented
def __call__(self, doc): (source)

Cleans the document.

def __init__(self, **kw): (source)

Undocumented

def allow_element(self, el): (source)

Decide whether an element is configured to be accepted or rejected. :param el: an element. :return: true to accept the element or false to reject/discard it.

def allow_embedded_url(self, el, url): (source)

Decide whether a URL that was found in an element's attributes or text if configured to be accepted or rejected. :param el: an element. :param url: a URL found on the element. :return: true to accept the URL and false to reject it.

def allow_follow(self, anchor): (source)

Override to suppress rel="nofollow" on some anchors.

def clean_html(self, html): (source)

Undocumented

def kill_conditional_comments(self, doc): (source)

IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.

add_nofollow: bool = (source)

Undocumented

allow_tags = (source)

Undocumented

annoying_tags: bool = (source)

Undocumented

comments: bool = (source)

Undocumented

embedded: bool = (source)

Undocumented

Undocumented

Undocumented

host_whitelist: tuple = (source)

Undocumented

javascript: bool = (source)

Undocumented

kill_tags = (source)

Undocumented

Undocumented

Undocumented

page_structure: bool = (source)

Undocumented

processing_instructions: bool = (source)

Undocumented

remove_tags = (source)

Undocumented

safe_attrs_only: bool = (source)

Undocumented

Undocumented

Undocumented

whitelist_tags: set[str] = (source)

Undocumented

inline_style = (source)

Undocumented

remove_unknown_tags: bool = (source)

Undocumented

def _has_sneaky_javascript(self, style): (source)

Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this. Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.

def _kill_elements(self, doc, condition, iterate=None): (source)

Undocumented

def _remove_javascript_link(self, link): (source)

Undocumented

_substitute_comments = (source)

Undocumented

_tag_link_attrs = (source)

Undocumented