bs4.builder.TreeBuilder

class documentation

class TreeBuilder(object): (source)

Known subclasses: bs4.builder._lxml.LXMLTreeBuilderForXML, bs4.builder.HTMLTreeBuilder, bs4.builder.SAXTreeBuilder

View In Hierarchy

Turn a textual document into a Beautiful Soup object tree.

Method	`__init__`	Constructor.
Method	`can_be_empty_element`	Might a tag with this name be an empty-element tag?
Method	`feed`	Run some incoming markup through some parsing process, populating the `BeautifulSoup` object in self.soup.
Method	`initialize_soup`	The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder.
Method	`prepare_markup`	Run any preliminary steps necessary to make incoming markup acceptable to the parser.
Method	`reset`	Do any work necessary to reset the underlying parser for a new document.
Method	`set_up_substitutions`	Set up any substitutions that will need to be performed on a `Tag` when it's output as a string.
Method	`test_fragment_to_document`	Wrap an HTML fragment to make it look like a document.
Constant	`ALTERNATE_NAMES`	Undocumented
Constant	`DEFAULT_CDATA_LIST_ATTRIBUTES`	Undocumented
Constant	`DEFAULT_PRESERVE_WHITESPACE_TAGS`	Undocumented
Constant	`DEFAULT_STRING_CONTAINERS`	Undocumented
Constant	`NAME`	Undocumented
Constant	`TRACKS_LINE_NUMBERS`	Undocumented
Constant	`USE_DEFAULT`	Undocumented
Class Variable	`empty_element_tags`	Undocumented
Class Variable	`features`	Undocumented
Class Variable	`is_xml`	Undocumented
Class Variable	`picklable`	Undocumented
Instance Variable	`cdata_list_attributes`	Undocumented
Instance Variable	`preserve_whitespace_tags`	Undocumented
Instance Variable	`soup`	Undocumented
Instance Variable	`store_line_numbers`	Undocumented
Instance Variable	`string_containers`	Undocumented
Method	`_replace_cdata_list_attribute_values`	When an attribute value is associated with a tag that can have multiple values for that attribute, convert the string value to a list of strings.

def __init__(self, multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT, store_line_numbers=USE_DEFAULT, string_containers=USE_DEFAULT): (source) ¶

overridden in bs4.builder._htmlparser.HTMLParserTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Constructor. :param multi_valued_attributes: If this is set to None, the TreeBuilder will not turn any values for attributes like 'class' into lists. Setting this to a dictionary will customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES for an example. Internally, these are called "CDATA list attributes", but that probably doesn't make sense to an end-user, so the argument name is `multi_valued_attributes`. :param preserve_whitespace_tags: A list of tags to treat the way <pre> tags are treated in HTML. Tags in this list are immune from pretty-printing; their contents will always be output as-is. :param string_containers: A dictionary mapping tag names to the classes that should be instantiated to contain the textual contents of those tags. The default is to use NavigableString for every tag, no matter what the name. You can override the default by changing DEFAULT_STRING_CONTAINERS. :param store_line_numbers: If the parser keeps track of the line numbers and positions of the original markup, that information will, by default, be stored in each corresponding `Tag` object. You can turn this off by passing store_line_numbers=False. If the parser you're using doesn't keep track of this information, then setting store_line_numbers=True will do nothing.

def can_be_empty_element(self, tag_name): (source) ¶

Might a tag with this name be an empty-element tag? The final markup may or may not actually present this tag as self-closing. For instance: an HTMLBuilder does not consider a tag to be an empty-element tag (it's not in HTMLBuilder.empty_element_tags). This means an empty tag will be presented as "", not "" or "". The default implementation has no opinion about which tags are empty-element tags, so a tag will be presented as an empty-element tag if and only if it has no children. "<foo></foo>" will become "<foo/>", and "<foo>bar</foo>" will be left alone. :param tag_name: The name of a markup tag.

def feed(self, markup): (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._htmlparser.HTMLParserTreeBuilder, bs4.builder._lxml.LXMLTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML, bs4.builder.SAXTreeBuilder

Run some incoming markup through some parsing process, populating the `BeautifulSoup` object in self.soup. This method is not implemented in TreeBuilder; it must be implemented in subclasses. :return: None.

def initialize_soup(self, soup): (source) ¶

overridden in bs4.builder._lxml.LXMLTreeBuilderForXML

The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder. :param soup: A BeautifulSoup object.

def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None, exclude_encodings=None): (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._htmlparser.HTMLParserTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Run any preliminary steps necessary to make incoming markup acceptable to the parser. :param markup: Some markup -- probably a bytestring. :param user_specified_encoding: The user asked to try this encoding. :param document_declared_encoding: The markup itself claims to be in this encoding. NOTE: This argument is not used by the calling code and can probably be removed. :param exclude_encodings: The user asked _not_ to try any of these encodings. :yield: A series of 4-tuples: (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn. By default, the only strategy is to parse the markup as-is. See `LXMLTreeBuilderForXML` and `HTMLParserTreeBuilder` for implementations that take into account the quirks of particular parsers.

def reset(self): (source) ¶

Do any work necessary to reset the underlying parser for a new document. By default, this does nothing.

def set_up_substitutions(self, tag): (source) ¶

overridden in bs4.builder.HTMLTreeBuilder

Set up any substitutions that will need to be performed on a `Tag` when it's output as a string. By default, this does nothing. See `HTMLTreeBuilder` for a case where this is used. :param tag: A `Tag` :return: Whether or not a substitution was performed.

def test_fragment_to_document(self, fragment): (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._lxml.LXMLTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Wrap an HTML fragment to make it look like a document. Different parsers do this differently. For instance, lxml introduces an empty <head> tag, and html5lib doesn't. Abstracting this away lets us write simple tests which run HTML fragments through the parser and compare the results against other HTML fragments. This method should not be used outside of tests. :param fragment: A string -- fragment of HTML. :return: A string -- a full HTML document.

ALTERNATE_NAMES: list = (source) ¶

overridden in bs4.builder._lxml.LXMLTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Undocumented

Value

[]

DEFAULT_CDATA_LIST_ATTRIBUTES = (source) ¶

overridden in bs4.builder.HTMLTreeBuilder

Undocumented

Value

defaultdict(list)

DEFAULT_PRESERVE_WHITESPACE_TAGS = (source) ¶

overridden in bs4.builder.HTMLTreeBuilder

Undocumented

Value

set()

DEFAULT_STRING_CONTAINERS: dict = (source) ¶

overridden in bs4.builder.HTMLTreeBuilder

Undocumented

Value

{}

NAME: str = (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Undocumented

Value

'[Unknown tree builder]'

TRACKS_LINE_NUMBERS: bool = (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._htmlparser.HTMLParserTreeBuilder

Undocumented

Value

False

USE_DEFAULT = (source) ¶

Undocumented

Value

object()

empty_element_tags = (source) ¶

overridden in bs4.builder._lxml.LXMLTreeBuilderForXML, bs4.builder.HTMLTreeBuilder

Undocumented

features: list = (source) ¶

overridden in bs4.builder._html5lib.HTML5TreeBuilder, bs4.builder._htmlparser.HTMLParserTreeBuilder, bs4.builder._lxml.LXMLTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Undocumented

is_xml: bool = (source) ¶

overridden in bs4.builder._htmlparser.HTMLParserTreeBuilder, bs4.builder._lxml.LXMLTreeBuilder, bs4.builder._lxml.LXMLTreeBuilderForXML

Undocumented

picklable: bool = (source) ¶

overridden in bs4.builder._htmlparser.HTMLParserTreeBuilder

Undocumented

cdata_list_attributes = (source) ¶

Undocumented

preserve_whitespace_tags = (source) ¶

Undocumented

soup = (source) ¶

overridden in bs4.builder._lxml.LXMLTreeBuilderForXML

Undocumented

store_line_numbers = (source) ¶

Undocumented

string_containers = (source) ¶

Undocumented

def _replace_cdata_list_attribute_values(self, tag_name, attrs): (source) ¶

When an attribute value is associated with a tag that can have multiple values for that attribute, convert the string value to a list of strings. Basically, replaces class="foo bar" with class=["foo", "bar"] NOTE: This method modifies its input in place. :param tag_name: The name of a tag. :param attrs: A dictionary containing the tag's attributes. Any appropriate attribute values will be modified in place.