module documentation

An interface to html5lib that mimics the lxml.html interface.

Class HTMLParser An html5lib HTML parser with lxml as tree.
Function document_fromstring Parse a whole document into a string.
Function fragment_fromstring Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
Function fragments_fromstring Parses several HTML elements, returning a list of elements.
Function fromstring Parse the html, returning a single element/document.
Function parse Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root.
Variable html_parser Undocumented
Function _find_tag Undocumented
Function _looks_like_url Undocumented
def document_fromstring(html, guess_charset=None, parser=None): (source)

Parse a whole document into a string. If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.

def fragment_fromstring(html, create_parent=False, guess_charset=None, parser=None): (source)

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If 'create_parent' is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.

def fragments_fromstring(html, no_leading_text=False, guess_charset=None, parser=None): (source)

Parses several HTML elements, returning a list of elements. The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.

def fromstring(html, guess_charset=None, parser=None): (source)

Parse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. 'base_url' will set the document's base_url attribute (and the tree's docinfo.URL) If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.

def parse(filename_url_or_file, guess_charset=None, parser=None): (source)

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. If ``guess_charset`` is true, the ``useChardet`` option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).

html_parser = (source)

Undocumented

def _find_tag(tree, tag): (source)

Undocumented

def _looks_like_url(str): (source)

Undocumented