package documentation

The ``lxml.html`` tool set for HTML handling.

Module builder A set of HTML generator tags for building HTML documents.
Module clean A cleanup tool for HTML.
Module defs Data taken from https://www.w3.org/TR/html401/index/elements.html and https://www.w3.org/community/webed/wiki/HTML/New_HTML5_Elements for html5_tags.
Module diff No module docstring; 0/9 variable, 32/36 functions, 1/1 exception, 4/6 classes documented
Module ElementSoup Undocumented
Module formfill No module docstring; 0/5 variable, 0/15 function, 1/1 exception, 0/1 class documented
Module html5parser An interface to html5lib that mimics the lxml.html interface.
Module soupparser External interface to the BeautifulSoup HTML parser.
Package tests No package docstring; 1/14 module documented
Module usedoctest Doctest module for HTML comparison.
Module _diffcommand Undocumented
Module _html5builder Legacy module - don't use in new code!
Module _setmixin No module docstring; 1/1 class documented

From __init__.py:

Class CheckboxGroup Represents a group of checkboxes (``<input type=checkbox>``) that have the same name.
Class CheckboxValues Represents the values of the checked checkboxes in a group of checkboxes with the same name.
Class Classes Provides access to an element's class attribute as a set-like collection. Usage::
Class FieldsDict Undocumented
Class FormElement Represents a <form> element.
Class HtmlComment Undocumented
Class HtmlElement Undocumented
Class HtmlElementClassLookup A lookup scheme for HTML Element classes.
Class HtmlEntity Undocumented
Class HtmlMixin No class docstring; 6/6 properties, 12/15 methods documented
Class HTMLParser An HTML parser that is configured to return lxml.html Element objects.
Class HtmlProcessingInstruction Undocumented
Class InputElement Represents an ``<input>`` element.
Class InputGetter An accessor that represents all the input fields in a form.
Class InputMixin Mix-in for all input elements (input, select, and textarea)
Class LabelElement Represents a ``<label>`` element.
Class MultipleSelectOptions Represents all the selected options in a ``<select multiple>`` element.
Class RadioGroup This object represents several ``<input type=radio>`` elements that have the same name.
Class SelectElement ``<select>`` element. You can get the name with ``.name``.
Class TextareaElement ``<textarea>`` element. You can get the name with ``.name`` and get/set the value with ``.value``
Class XHTMLParser An XML parser that is configured to return lxml.html Element objects.
Function document_fromstring Undocumented
Function Element Create a new HTML Element.
Function fragment_fromstring Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
Function fragments_fromstring Parses several HTML elements, returning a list of elements.
Function fromstring Parse the html, returning a single element/document.
Function html_to_xhtml Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.
Function open_http_urllib Undocumented
Function open_in_browser Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.
Function parse Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root.
Function submit_form Helper function to submit a form. Returns a file-like object, as from ``urllib.urlopen()``. This object also has a ``.geturl()`` function, which shows the URL if there were any redirects.
Function tostring Return an HTML string representation of the document.
Function xhtml_to_html Convert all tags in an XHTML tree to HTML by removing their XHTML namespace.
Constant XHTML_NAMESPACE Undocumented
Variable find_class Undocumented
Variable find_rel_links Undocumented
Variable html_parser Undocumented
Variable iterlinks Undocumented
Variable make_links_absolute Undocumented
Variable resolve_base_href Undocumented
Variable rewrite_links Undocumented
Variable xhtml_parser Undocumented
Class _MethodFunc An object that represents a method on an element as a function; the function takes either an element or an HTML string. It returns whatever the function normally returns, or if the function works in-place (and so returns None) it returns a serialized form of the resulting document.
Function __fix_docstring Undocumented
Function _contains_block_level_tag Undocumented
Function _element_name Undocumented
Function _nons Undocumented
Function _transform_result Convert the result back into the input type.
Function _unquote_match Undocumented
Variable __bytes_replace_meta_content_type Undocumented
Variable __str_replace_meta_content_type Undocumented
Variable _archive_re Undocumented
Variable _class_xpath Undocumented
Variable _collect_string_content Undocumented
Variable _forms_xpath Undocumented
Variable _id_xpath Undocumented
Variable _iter_css_imports Undocumented
Variable _iter_css_urls Undocumented
Variable _label_xpath Undocumented
Variable _looks_like_full_html_bytes Undocumented
Variable _looks_like_full_html_unicode Undocumented
Variable _options_xpath Undocumented
Variable _parse_meta_refresh_url Undocumented
Variable _rel_links_xpath Undocumented
def __fix_docstring(s): (source)

Undocumented

XHTML_NAMESPACE: str = (source)

Undocumented

Value
'http://www.w3.org/1999/xhtml'
_rel_links_xpath = (source)

Undocumented

_options_xpath = (source)

Undocumented

_forms_xpath = (source)

Undocumented

_class_xpath = (source)

Undocumented

_id_xpath = (source)

Undocumented

_collect_string_content = (source)

Undocumented

_iter_css_urls = (source)

Undocumented

_iter_css_imports = (source)

Undocumented

_label_xpath = (source)

Undocumented

_archive_re = (source)

Undocumented

_parse_meta_refresh_url = (source)

Undocumented

def _unquote_match(s, pos): (source)

Undocumented

def _transform_result(typ, result): (source)

Convert the result back into the input type.

def _nons(tag): (source)

Undocumented

find_rel_links = (source)

Undocumented

find_class = (source)

Undocumented

make_links_absolute = (source)

Undocumented

resolve_base_href = (source)

Undocumented

iterlinks = (source)

Undocumented

rewrite_links = (source)

Undocumented

_looks_like_full_html_unicode = (source)

Undocumented

_looks_like_full_html_bytes = (source)

Undocumented

def document_fromstring(html, parser=None, ensure_head_body=False, **kw): (source)

Undocumented

def fragments_fromstring(html, no_leading_text=False, base_url=None, parser=None, **kw): (source)

Parses several HTML elements, returning a list of elements. The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. base_url will set the document's base_url attribute (and the tree's docinfo.URL).

def fragment_fromstring(html, create_parent=False, base_url=None, parser=None, **kw): (source)

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If ``create_parent`` is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing. Passing a ``base_url`` will set the document's ``base_url`` attribute (and the tree's docinfo.URL).

def fromstring(html, base_url=None, parser=None, **kw): (source)

Parse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. base_url will set the document's base_url attribute (and the tree's docinfo.URL)

def parse(filename_or_url, parser=None, base_url=None, **kw): (source)

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. You can override the base URL with the ``base_url`` keyword. This is most useful when parsing from a file-like object.

def _contains_block_level_tag(el): (source)

Undocumented

def _element_name(el): (source)

Undocumented

def submit_form(form, extra_values=None, open_http=None): (source)

Helper function to submit a form. Returns a file-like object, as from ``urllib.urlopen()``. This object also has a ``.geturl()`` function, which shows the URL if there were any redirects. You can use this like:: form = doc.forms[0] form.inputs['foo'].value = 'bar' # etc response = form.submit() doc = parse(response) doc.make_links_absolute(response.geturl()) To change the HTTP requester, pass a function as ``open_http`` keyword argument that opens the URL for you. The function must have the following signature:: open_http(method, URL, values) The action is one of 'GET' or 'POST', the URL is the target URL as a string, and the values are a sequence of ``(name, value)`` tuples with the form data.

def open_http_urllib(method, url, values): (source)

Undocumented

def html_to_xhtml(html): (source)

Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.

def xhtml_to_html(xhtml): (source)

Convert all tags in an XHTML tree to HTML by removing their XHTML namespace.

__str_replace_meta_content_type = (source)

Undocumented

__bytes_replace_meta_content_type = (source)

Undocumented

def tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html', with_tail=True, doctype=None): (source)

Return an HTML string representation of the document. Note: if include_meta_content_type is true this will create a ``<meta http-equiv="Content-Type" ...>`` tag in the head; regardless of the value of include_meta_content_type any existing ``<meta http-equiv="Content-Type" ...>`` tag will be removed The ``encoding`` argument controls the output encoding (defaults to ASCII, with &#...; character references for any characters outside of ASCII). Note that you can pass the name ``'unicode'`` as ``encoding`` argument to serialise to a Unicode string. The ``method`` argument defines the output method. It defaults to 'html', but can also be 'xml' for xhtml output, or 'text' to serialise to plain text without markup. To leave out the tail text of the top-level element that is being serialised, pass ``with_tail=False``. The ``doctype`` option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance. Example:: >>> from lxml import html >>> root = html.fragment_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(root) b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='html') b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='xml') b'<p>Hello<br/>world!</p>' >>> html.tostring(root, method='text') b'Helloworld!' >>> html.tostring(root, method='text', encoding='unicode') u'Helloworld!' >>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>') >>> html.tostring(root[0], method='text', encoding='unicode') u'Helloworld!TAIL' >>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False) u'Helloworld!' >>> doc = html.document_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(doc, method='html', encoding='unicode') u'<html><body><p>Hello<br>world!</p></body></html>' >>> print(html.tostring(doc, method='html', encoding='unicode', ... doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"' ... ' "http://www.w3.org/TR/html4/strict.dtd">')) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><body><p>Hello<br>world!</p></body></html>

def open_in_browser(doc, encoding=None): (source)

Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.

def Element(*args, **kw): (source)

Create a new HTML Element. This can also be used for XHTML documents.

html_parser = (source)

Undocumented

xhtml_parser = (source)

Undocumented