bs4.builder._htmlparser.BeautifulSoupHTMLParser

class documentation

class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): (source)

A subclass of the Python standard library's HTMLParser class, which listens for HTMLParser events and translates them into calls to Beautiful Soup's tree construction API.

Method	`__init__`	Constructor.
Method	`handle_charref`	Handle a numeric character reference by converting it to the corresponding Unicode character and treating it as textual data.
Method	`handle_comment`	Handle an HTML comment.
Method	`handle_data`	Handle some textual data that shows up between tags.
Method	`handle_decl`	Handle a DOCTYPE declaration.
Method	`handle_endtag`	Handle a closing tag, e.g. '</tag>'
Method	`handle_entityref`	Handle a named entity reference by converting it to the corresponding Unicode character(s) and treating it as textual data.
Method	`handle_pi`	Handle a processing instruction.
Method	`handle_startendtag`	Handle an incoming empty-element tag.
Method	`handle_starttag`	Handle an opening tag, e.g. '<tag>'
Method	`unknown_decl`	Handle a declaration of unknown type -- probably a CDATA block.
Constant	`IGNORE`	Undocumented
Constant	`REPLACE`	Undocumented
Instance Variable	`already_closed_empty_element`	Undocumented
Instance Variable	`on_duplicate_attribute`	Undocumented

Inherited from DetectsXMLParsedAsHTML:

Class Method	`warn_if_markup_looks_like_xml`	Perform a check on some markup to see if it looks like XML that's not XHTML. If so, issue a warning.
Constant	`LOOKS_LIKE_HTML`	Undocumented
Constant	`LOOKS_LIKE_HTML_B`	Undocumented
Constant	`XML_PREFIX`	Undocumented
Constant	`XML_PREFIX_B`	Undocumented
Class Method	`_warn`	Issue a warning about XML being parsed as HTML.
Method	`_document_might_be_xml`	Call this method when encountering an XML declaration, or a "processing instruction" that might be an XML declaration.
Method	`_initialize_xml_detector`	Call this method before parsing a document.
Method	`_root_tag_encountered`	Call this when you encounter the document's root tag.
Instance Variable	`_first_processing_instruction`	Undocumented
Instance Variable	`_root_tag`	Undocumented

def __init__(self, *args, **kwargs): (source) ¶

Constructor. :param on_duplicate_attribute: A strategy for what to do if a tag includes the same attribute more than once. Accepted values are: REPLACE (replace earlier values with later ones, the default), IGNORE (keep the earliest value encountered), or a callable. A callable must take three arguments: the dictionary of attributes already processed, the name of the duplicate attribute, and the most recent value encountered.

def handle_charref(self, name): (source) ¶

Handle a numeric character reference by converting it to the corresponding Unicode character and treating it as textual data. :param name: Character number, possibly in hexadecimal.

def handle_comment(self, data): (source) ¶

Handle an HTML comment. :param data: The text of the comment.

def handle_data(self, data): (source) ¶

Handle some textual data that shows up between tags.

def handle_decl(self, data): (source) ¶

Handle a DOCTYPE declaration. :param data: The text of the declaration.

def handle_endtag(self, name, check_already_closed=True): (source) ¶

Handle a closing tag, e.g. '</tag>' :param name: A tag name. :param check_already_closed: True if this tag is expected to be the closing portion of an empty-element tag, e.g. '<tag></tag>'.

def handle_entityref(self, name): (source) ¶

Handle a named entity reference by converting it to the corresponding Unicode character(s) and treating it as textual data. :param name: Name of the entity reference.

def handle_pi(self, data): (source) ¶

Handle a processing instruction. :param data: The text of the instruction.

def handle_startendtag(self, name, attrs): (source) ¶

Handle an incoming empty-element tag. This is only called when the markup looks like <tag/>. :param name: Name of the tag. :param attrs: Dictionary of the tag's attributes.

def handle_starttag(self, name, attrs, handle_empty_element=True): (source) ¶

Handle an opening tag, e.g. '<tag>' :param name: Name of the tag. :param attrs: Dictionary of the tag's attributes. :param handle_empty_element: True if this tag is known to be an empty-element tag (i.e. there is not expected to be any closing tag).

def unknown_decl(self, data): (source) ¶

Handle a declaration of unknown type -- probably a CDATA block. :param data: The text of the declaration.

IGNORE: str = (source) ¶

Undocumented

Value

'ignore'

REPLACE: str = (source) ¶

Undocumented

Value

'replace'

already_closed_empty_element: list = (source) ¶

Undocumented

on_duplicate_attribute = (source) ¶

Undocumented