bs4.BeautifulSoup

class documentation

class BeautifulSoup(Tag): (source)

Known subclasses: bs4.BeautifulStoneSoup

A data structure representing a parsed HTML or XML document. Most of the methods you'll call on a BeautifulSoup object are inherited from PageElement or Tag. Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you'll need to understand these methods as a whole. These methods will be called by the BeautifulSoup constructor: * reset() * feed(markup) The tree builder may call these methods from its feed() implementation: * handle_starttag(name, attrs) # See note about return value * handle_endtag(name) * handle_data(data) # Appends to the current data node * endData(containerClass) # Ends the current data node No matter how complicated the underlying parser is, you should be able to build a tree using 'start tag' events, 'end tag' events, 'data' events, and "done with data" events. If you encounter an empty-element tag (aka a self-closing tag, like HTML's <br> tag), call handle_starttag and then handle_endtag.

Method	`__copy__`	Copy a BeautifulSoup object by converting the document to a string and parsing it again.
Method	`__getstate__`	Undocumented
Method	`__init__`	Constructor.
Method	`decode`	Returns a string or Unicode representation of the parse tree as an HTML or XML document.
Method	`endData`	Method called by the TreeBuilder when the end of a data segment occurs.
Method	`handle_data`	Called by the tree builder when a chunk of textual data is encountered.
Method	`handle_endtag`	Called by the tree builder when an ending tag is encountered.
Method	`handle_starttag`	Called by the tree builder when a new tag is encountered.
Method	`insert_after`	This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
Method	`insert_before`	This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
Method	`new_string`	Create a new NavigableString associated with this BeautifulSoup object.
Method	`new_tag`	Create a new Tag associated with this BeautifulSoup object.
Method	`object_was_parsed`	Method called by the TreeBuilder to integrate an object into the parse tree.
Method	`popTag`	Internal method called by _popToTag when a tag is closed.
Method	`pushTag`	Internal method called by handle_starttag when a tag is opened.
Method	`reset`	Reset this object to a state as though it had never parsed any markup.
Method	`string_container`	Undocumented
Constant	`ASCII_SPACES`	Undocumented
Constant	`DEFAULT_BUILDER_FEATURES`	Undocumented
Constant	`NO_PARSER_SPECIFIED_WARNING`	Undocumented
Constant	`ROOT_TAG_NAME`	Undocumented
Instance Variable	`builder`	Undocumented
Instance Variable	`current_data`	Undocumented
Instance Variable	`currentTag`	Undocumented
Instance Variable	`element_classes`	Undocumented
Instance Variable	`hidden`	Undocumented
Instance Variable	`is_xml`	Undocumented
Instance Variable	`known_xml`	Undocumented
Instance Variable	`markup`	Undocumented
Instance Variable	`open_tag_counter`	Undocumented
Instance Variable	`parse_only`	Undocumented
Instance Variable	`preserve_whitespace_tag_stack`	Undocumented
Instance Variable	`string_container_stack`	Undocumented
Instance Variable	`tagStack`	Undocumented
Class Method	`_decode_markup`	Ensure `markup` is bytes so it's safe to send into warnings.warn.
Class Method	`_markup_is_url`	Error-handling method to raise a warning if incoming markup looks like a URL.
Class Method	`_markup_resembles_filename`	Error-handling method to raise a warning if incoming markup resembles a filename.
Method	`_feed`	Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.
Method	`_linkage_fixer`	Make sure linkage of this fragment is sound.
Method	`_popToTag`	Pops the tag stack up to and including the most recent instance of the given tag.
Instance Variable	`_most_recent_element`	Undocumented
Instance Variable	`_namespaces`	Undocumented

Inherited from Tag:

Method	`__bool__`	A tag is non-None even if it has no contents.
Method	`__call__`	Calling a Tag like a function is the same as calling its find_all() method. Eg. tag('a') returns a list of all the A tags found within this tag.
Method	`__contains__`	Undocumented
Method	`__delitem__`	Deleting tag[key] deletes all 'key' attributes for the tag.
Method	`__eq__`	Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as `other`.
Method	`__getattr__`	Calling tag.subtag is the same as calling tag.find(name="subtag")
Method	`__getitem__`	tag[key] returns the value of the 'key' attribute for the Tag, and throws an exception if it's not there.
Method	`__hash__`	Undocumented
Method	`__iter__`	Iterating over a Tag iterates over its contents.
Method	`__len__`	The length of a Tag is the length of its list of contents.
Method	`__ne__`	Returns true iff this Tag is not identical to `other`, as defined in __eq__.
Method	`__repr__`	Renders this PageElement as a string.
Method	`__setitem__`	Setting tag[key] sets the value of the 'key' attribute for the tag.
Method	`__unicode__`	Renders this PageElement as a Unicode string.
Method	`childGenerator`	Deprecated generator.
Method	`clear`	Wipe out all children of this PageElement by calling extract() on them.
Method	`decode_contents`	Renders the contents of this tag as a Unicode string.
Method	`decompose`	Recursively destroys this PageElement and its children.
Method	`encode`	Render a bytestring representation of this PageElement and its contents.
Method	`encode_contents`	Renders the contents of this PageElement as a bytestring.
Method	`find`	Look in the children of this PageElement and find the first PageElement that matches the given criteria.
Method	`find_all`	Look in the children of this PageElement and find all PageElements that match the given criteria.
Method	`get`	Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.
Method	`get_attribute_list`	The same as get(), but always returns a list.
Method	`has_attr`	Does this PageElement have an attribute with the given name?
Method	`has_key`	Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents).
Method	`index`	Find the index of a child by identity, not value.
Method	`prettify`	Pretty-print this PageElement as a string.
Method	`recursiveChildGenerator`	Deprecated generator.
Method	`renderContents`	Deprecated method for BS3 compatibility.
Method	`select`	Perform a CSS selection operation on the current element.
Method	`select_one`	Perform a CSS selection operation on the current element.
Method	`smooth`	Smooth out this element's children by consolidating consecutive strings.
Method	`string.setter`	Replace this PageElement's contents with `string`.
Constant	`DEFAULT_INTERESTING_STRING_TYPES`	Undocumented
Class Variable	`parserClass`	Undocumented
Class Variable	`strings`	Undocumented
Instance Variable	`attrs`	Undocumented
Instance Variable	`can_be_empty_element`	Undocumented
Instance Variable	`cdata_list_attributes`	Undocumented
Instance Variable	`contents`	Undocumented
Instance Variable	`interesting_string_types`	Undocumented
Instance Variable	`name`	Undocumented
Instance Variable	`namespace`	Undocumented
Instance Variable	`parser_class`	Undocumented
Instance Variable	`prefix`	Undocumented
Instance Variable	`preserve_whitespace_tags`	Undocumented
Instance Variable	`sourceline`	Undocumented
Instance Variable	`sourcepos`	Undocumented
Property	`children`	Iterate over all direct children of this PageElement.
Property	`descendants`	Iterate over all children of this PageElement in a breadth-first sequence.
Property	`is_empty_element`	Is this tag an empty-element tag? (aka a self-closing tag)
Property	`string`	Convenience property to get the single string within this PageElement.
Method	`_all_strings`	Yield all strings of certain classes, possibly stripping them.
Method	`_should_pretty_print`	Should this tag be pretty-printed?

Inherited from PageElement (via Tag):

Method	`append`	Appends the given PageElement to the contents of this one.
Method	`extend`	Appends the given PageElements to this one's contents.
Method	`extract`	Destructively rips this element out of the tree.
Method	`find_all_next`	Find all PageElements that match the given criteria and appear later in the document than this PageElement.
Method	`find_all_previous`	Look backwards in the document from this PageElement and find all PageElements that match the given criteria.
Method	`find_next`	Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.
Method	`find_next_sibling`	Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.
Method	`find_next_siblings`	Find all siblings of this PageElement that match the given criteria and appear later in the document.
Method	`find_parent`	Find the closest parent of this PageElement that matches the given criteria.
Method	`find_parents`	Find all parents of this PageElement that match the given criteria.
Method	`find_previous`	Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.
Method	`find_previous_sibling`	Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.
Method	`find_previous_siblings`	Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.
Method	`format_string`	Format the given string using the given formatter.
Method	`formatter_for_name`	Look up or create a Formatter for the given identifier, if necessary.
Method	`get_text`	Get all child strings of this PageElement, concatenated using the given separator.
Method	`insert`	Insert a new PageElement in the list of this PageElement's children.
Method	`nextGenerator`	Undocumented
Method	`nextSiblingGenerator`	Undocumented
Method	`parentGenerator`	Undocumented
Method	`previousGenerator`	Undocumented
Method	`previousSiblingGenerator`	Undocumented
Method	`replace_with`	Replace this PageElement with one or more PageElements, keeping the rest of the tree the same.
Method	`setup`	Sets up the initial relations between this element and other elements.
Method	`unwrap`	Replace this PageElement with its contents.
Method	`wrap`	Wrap this PageElement inside another one.
Class Variable	`default`	Undocumented
Class Variable	`nextSibling`	Undocumented
Class Variable	`previousSibling`	Undocumented
Class Variable	`text`	Undocumented
Instance Variable	`next_element`	Undocumented
Instance Variable	`next_sibling`	Undocumented
Instance Variable	`parent`	Undocumented
Instance Variable	`previous_element`	Undocumented
Instance Variable	`previous_sibling`	Undocumented
Property	`decomposed`	Check whether a PageElement has been decomposed.
Property	`next`	The PageElement, if any, that was parsed just after this one.
Property	`next_elements`	All PageElements that were parsed after this one.
Property	`next_siblings`	All PageElements that are siblings of this one but were parsed later.
Property	`parents`	All PageElements that are parents of this PageElement.
Property	`previous`	The PageElement, if any, that was parsed just before this one.
Property	`previous_elements`	All PageElements that were parsed before this one.
Property	`previous_siblings`	All PageElements that are siblings of this one but were parsed earlier.
Property	`stripped_strings`	Yield all strings in this PageElement, stripping them first.
Method	`_find_all`	Iterates over a generator looking for things that match.
Method	`_find_one`	Undocumented
Method	`_last_descendant`	Finds the last element beneath this object to be parsed.
Property	`_is_xml`	Is this element part of an XML tree or an HTML tree?

def __copy__(self): (source) ¶

overrides bs4.element.Tag.__copy__

Copy a BeautifulSoup object by converting the document to a string and parsing it again.

def __getstate__(self): (source) ¶

Undocumented

def __init__(self, markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs): (source) ¶

overrides bs4.element.Tag.__init__

overridden in bs4.BeautifulStoneSoup

Constructor. :param markup: A string or a file-like object representing markup to be parsed. :param features: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments. :param builder: A TreeBuilder subclass to instantiate (or instance to use) instead of looking one up based on `features`. You only need to use this if you've implemented a custom TreeBuilder. :param parse_only: A SoupStrainer. Only parts of the document matching the SoupStrainer will be considered. This is useful when parsing part of a document that would otherwise be too large to fit into memory. :param from_encoding: A string indicating the encoding of the document to be parsed. Pass this in if Beautiful Soup is guessing wrongly about the document's encoding. :param exclude_encodings: A list of strings indicating encodings known to be wrong. Pass this in if you don't know the document's encoding but you know Beautiful Soup's guess is wrong. :param element_classes: A dictionary mapping BeautifulSoup classes like Tag and NavigableString, to other classes you'd like to be instantiated instead as the parse tree is built. This is useful for subclassing Tag or NavigableString to modify default behavior. :param kwargs: For backwards compatibility purposes, the constructor accepts certain keyword arguments used in Beautiful Soup 3. None of these arguments do anything in Beautiful Soup 4; they will result in a warning and then be ignored. Apart from this, any keyword arguments passed into the BeautifulSoup constructor are propagated to the TreeBuilder constructor. This makes it possible to configure a TreeBuilder by passing in arguments, not just by saying which one to use.

def decode(self, pretty_print=False, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal'): (source) ¶

overrides bs4.element.Tag.decode

Returns a string or Unicode representation of the parse tree as an HTML or XML document. :param pretty_print: If this is True, indentation will be used to make the document more readable. :param eventual_encoding: The encoding of the final document. If this is None, the document will be a Unicode string.

def endData(self, containerClass=None): (source) ¶

Method called by the TreeBuilder when the end of a data segment occurs.

def handle_data(self, data): (source) ¶

Called by the tree builder when a chunk of textual data is encountered.

def handle_endtag(self, name, nsprefix=None): (source) ¶

Called by the tree builder when an ending tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag.

def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None, sourcepos=None, namespaces=None): (source) ¶

Called by the tree builder when a new tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag. :param attrs: A dictionary of attribute values. :param sourceline: The line number where this tag was found in its source document. :param sourcepos: The character position within `sourceline` where this tag was found. :param namespaces: A dictionary of all namespace prefix mappings currently in scope in the document. If this method returns None, the tag was rejected by an active SoupStrainer. You should proceed as if the tag had not occurred in the document. For instance, if this was a self-closing tag, don't call handle_endtag.

def insert_after(self, *args): (source) ¶

overrides bs4.element.PageElement.insert_after

This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.

def insert_before(self, *args): (source) ¶

overrides bs4.element.PageElement.insert_before

This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.

def new_string(self, s, subclass=None): (source) ¶

Create a new NavigableString associated with this BeautifulSoup object.

def new_tag(self, name, namespace=None, nsprefix=None, attrs={}, sourceline=None, sourcepos=None, **kwattrs): (source) ¶

Create a new Tag associated with this BeautifulSoup object. :param name: The name of the new Tag. :param namespace: The URI of the new Tag's XML namespace, if any. :param prefix: The prefix for the new Tag's XML namespace, if any. :param attrs: A dictionary of this Tag's attribute values; can be used instead of `kwattrs` for attributes like 'class' that are reserved words in Python. :param sourceline: The line number where this tag was (purportedly) found in its source document. :param sourcepos: The character position within `sourceline` where this tag was (purportedly) found. :param kwattrs: Keyword arguments for the new Tag's attribute values.

def object_was_parsed(self, o, parent=None, most_recent_element=None): (source) ¶

Method called by the TreeBuilder to integrate an object into the parse tree.

def popTag(self): (source) ¶

Internal method called by _popToTag when a tag is closed.

def pushTag(self, tag): (source) ¶

Internal method called by handle_starttag when a tag is opened.

def reset(self): (source) ¶

Reset this object to a state as though it had never parsed any markup.

def string_container(self, base_class=None): (source) ¶

Undocumented

ASCII_SPACES: str = (source) ¶

Undocumented

Value

''' 
\t\f\r'''

DEFAULT_BUILDER_FEATURES: list[str] = (source) ¶

Undocumented

Value

['html', 'fast']

NO_PARSER_SPECIFIED_WARNING: str = (source) ¶

Undocumented

Value

'''No parser was explicitly specified, so I\'m using the best available %(markup↵
_type)s parser for this system ("%(parser)s"). This usually isn\'t a problem, bu↵
t if you run this code on another system, or in a different virtual environment,↵
 it may use a different parser and behave differently.

The code that caused this warning is on line %(line_number)s of the file %(filen↵
ame)s. To get rid of this warning, pass the additional argument \'features="%(pa↵
...

ROOT_TAG_NAME: str = (source) ¶

Undocumented

Value

'[document]'

builder = (source) ¶

Undocumented

current_data: list = (source) ¶

Undocumented

currentTag = (source) ¶

Undocumented

element_classes = (source) ¶

Undocumented

hidden: int = (source) ¶

overrides bs4.element.Tag.hidden

Undocumented

is_xml = (source) ¶

Undocumented

known_xml = (source) ¶

overrides bs4.element.Tag.known_xml

Undocumented

markup = (source) ¶

Undocumented

open_tag_counter = (source) ¶

Undocumented

parse_only = (source) ¶

Undocumented

preserve_whitespace_tag_stack: list = (source) ¶

Undocumented

string_container_stack: list = (source) ¶

Undocumented

tagStack: list = (source) ¶

Undocumented

@classmethod
def _decode_markup(cls, markup): (source) ¶

Ensure `markup` is bytes so it's safe to send into warnings.warn. TODO: warnings.warn had this problem back in 2010 but it might not anymore.

@classmethod
def _markup_is_url(cls, markup): (source) ¶

Error-handling method to raise a warning if incoming markup looks like a URL. :param markup: A string. :return: Whether or not the markup resembles a URL closely enough to justify a warning.

@classmethod
def _markup_resembles_filename(cls, markup): (source) ¶

Error-handling method to raise a warning if incoming markup resembles a filename. :param markup: A bytestring or string. :return: Whether or not the markup resembles a filename closely enough to justify a warning.

def _feed(self): (source) ¶

Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.

def _linkage_fixer(self, el): (source) ¶

Make sure linkage of this fragment is sound.

def _popToTag(self, name, nsprefix=None, inclusivePop=True): (source) ¶

Pops the tag stack up to and including the most recent instance of the given tag. If there are no open tags with the given name, nothing will be popped. :param name: Pop up to the most recent tag with this name. :param nsprefix: The namespace prefix that goes with `name`. :param inclusivePop: It this is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.

_most_recent_element = (source) ¶

Undocumented

_namespaces = (source) ¶

overrides bs4.element.Tag._namespaces

Undocumented