An interface to html5lib that mimics the lxml.html interface.
Class |
|
An html5lib HTML parser with lxml as tree. |
Function | document |
Parse a whole document into a string. |
Function | fragment |
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. |
Function | fragments |
Parses several HTML elements, returning a list of elements. |
Function | fromstring |
Parse the html, returning a single element/document. |
Function | parse |
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. |
Variable | html |
Undocumented |
Function | _find |
Undocumented |
Function | _looks |
Undocumented |
Parse a whole document into a string. If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If 'create_parent' is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.
Parses several HTML elements, returning a list of elements. The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string.
Parse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. 'base_url' will set the document's base_url attribute (and the tree's docinfo.URL) If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string.
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. If ``guess_charset`` is true, the ``useChardet`` option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).