lxml.html.html5parser

module documentation

(source)

An interface to html5lib that mimics the lxml.html interface.

Class	`HTMLParser`	An html5lib HTML parser with lxml as tree.
Function	`document_fromstring`	Parse a whole document into a string.
Function	`fragment_fromstring`	Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
Function	`fragments_fromstring`	Parses several HTML elements, returning a list of elements.
Function	`fromstring`	Parse the html, returning a single element/document.
Function	`parse`	Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root.
Variable	`html_parser`	Undocumented
Function	`_find_tag`	Undocumented
Function	`_looks_like_url`	Undocumented

def document_fromstring(html, guess_charset=None, parser=None): (source) ¶

Parse a whole document into a string. If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.

def fragment_fromstring(html, create_parent=False, guess_charset=None, parser=None): (source) ¶

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If 'create_parent' is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.

def fragments_fromstring(html, no_leading_text=False, guess_charset=None, parser=None): (source) ¶

Parses several HTML elements, returning a list of elements. The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.

def fromstring(html, guess_charset=None, parser=None): (source) ¶

Parse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. 'base_url' will set the document's base_url attribute (and the tree's docinfo.URL) If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.

def parse(filename_url_or_file, guess_charset=None, parser=None): (source) ¶

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. If ``guess_charset`` is true, the ``useChardet`` option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).

html_parser = (source) ¶

Undocumented

def _find_tag(tree, tag): (source) ¶

Undocumented

def _looks_like_url(str): (source) ¶

Undocumented