class documentation

class BeautifulSoup(Tag): (source)

Known subclasses: bs4.BeautifulStoneSoup

View In Hierarchy

A data structure representing a parsed HTML or XML document. Most of the methods you'll call on a BeautifulSoup object are inherited from PageElement or Tag. Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you'll need to understand these methods as a whole. These methods will be called by the BeautifulSoup constructor: * reset() * feed(markup) The tree builder may call these methods from its feed() implementation: * handle_starttag(name, attrs) # See note about return value * handle_endtag(name) * handle_data(data) # Appends to the current data node * endData(containerClass) # Ends the current data node No matter how complicated the underlying parser is, you should be able to build a tree using 'start tag' events, 'end tag' events, 'data' events, and "done with data" events. If you encounter an empty-element tag (aka a self-closing tag, like HTML's <br> tag), call handle_starttag and then handle_endtag.

Method __copy__ Copy a BeautifulSoup object by converting the document to a string and parsing it again.
Method __getstate__ Undocumented
Method __init__ Constructor.
Method decode Returns a string or Unicode representation of the parse tree as an HTML or XML document.
Method endData Method called by the TreeBuilder when the end of a data segment occurs.
Method handle_data Called by the tree builder when a chunk of textual data is encountered.
Method handle_endtag Called by the tree builder when an ending tag is encountered.
Method handle_starttag Called by the tree builder when a new tag is encountered.
Method insert_after This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
Method insert_before This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
Method new_string Create a new NavigableString associated with this BeautifulSoup object.
Method new_tag Create a new Tag associated with this BeautifulSoup object.
Method object_was_parsed Method called by the TreeBuilder to integrate an object into the parse tree.
Method popTag Internal method called by _popToTag when a tag is closed.
Method pushTag Internal method called by handle_starttag when a tag is opened.
Method reset Reset this object to a state as though it had never parsed any markup.
Method string_container Undocumented
Constant ASCII_SPACES Undocumented
Constant DEFAULT_BUILDER_FEATURES Undocumented
Constant NO_PARSER_SPECIFIED_WARNING Undocumented
Constant ROOT_TAG_NAME Undocumented
Instance Variable builder Undocumented
Instance Variable current_data Undocumented
Instance Variable currentTag Undocumented
Instance Variable element_classes Undocumented
Instance Variable hidden Undocumented
Instance Variable is_xml Undocumented
Instance Variable known_xml Undocumented
Instance Variable markup Undocumented
Instance Variable open_tag_counter Undocumented
Instance Variable parse_only Undocumented
Instance Variable preserve_whitespace_tag_stack Undocumented
Instance Variable string_container_stack Undocumented
Instance Variable tagStack Undocumented
Class Method _decode_markup Ensure `markup` is bytes so it's safe to send into warnings.warn.
Class Method _markup_is_url Error-handling method to raise a warning if incoming markup looks like a URL.
Class Method _markup_resembles_filename Error-handling method to raise a warning if incoming markup resembles a filename.
Method _feed Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.
Method _linkage_fixer Make sure linkage of this fragment is sound.
Method _popToTag Pops the tag stack up to and including the most recent instance of the given tag.
Instance Variable _most_recent_element Undocumented
Instance Variable _namespaces Undocumented

Inherited from Tag:

Method __bool__ A tag is non-None even if it has no contents.
Method __call__ Calling a Tag like a function is the same as calling its find_all() method. Eg. tag('a') returns a list of all the A tags found within this tag.
Method __contains__ Undocumented
Method __delitem__ Deleting tag[key] deletes all 'key' attributes for the tag.
Method __eq__ Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as `other`.
Method __getattr__ Calling tag.subtag is the same as calling tag.find(name="subtag")
Method __getitem__ tag[key] returns the value of the 'key' attribute for the Tag, and throws an exception if it's not there.
Method __hash__ Undocumented
Method __iter__ Iterating over a Tag iterates over its contents.
Method __len__ The length of a Tag is the length of its list of contents.
Method __ne__ Returns true iff this Tag is not identical to `other`, as defined in __eq__.
Method __repr__ Renders this PageElement as a string.
Method __setitem__ Setting tag[key] sets the value of the 'key' attribute for the tag.
Method __unicode__ Renders this PageElement as a Unicode string.
Method childGenerator Deprecated generator.
Method clear Wipe out all children of this PageElement by calling extract() on them.
Method decode_contents Renders the contents of this tag as a Unicode string.
Method decompose Recursively destroys this PageElement and its children.
Method encode Render a bytestring representation of this PageElement and its contents.
Method encode_contents Renders the contents of this PageElement as a bytestring.
Method find Look in the children of this PageElement and find the first PageElement that matches the given criteria.
Method find_all Look in the children of this PageElement and find all PageElements that match the given criteria.
Method get Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.
Method get_attribute_list The same as get(), but always returns a list.
Method has_attr Does this PageElement have an attribute with the given name?
Method has_key Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents).
Method index Find the index of a child by identity, not value.
Method prettify Pretty-print this PageElement as a string.
Method recursiveChildGenerator Deprecated generator.
Method renderContents Deprecated method for BS3 compatibility.
Method select Perform a CSS selection operation on the current element.
Method select_one Perform a CSS selection operation on the current element.
Method smooth Smooth out this element's children by consolidating consecutive strings.
Method string.setter Replace this PageElement's contents with `string`.
Constant DEFAULT_INTERESTING_STRING_TYPES Undocumented
Class Variable parserClass Undocumented
Class Variable strings Undocumented
Instance Variable attrs Undocumented
Instance Variable can_be_empty_element Undocumented
Instance Variable cdata_list_attributes Undocumented
Instance Variable contents Undocumented
Instance Variable interesting_string_types Undocumented
Instance Variable name Undocumented
Instance Variable namespace Undocumented
Instance Variable parser_class Undocumented
Instance Variable prefix Undocumented
Instance Variable preserve_whitespace_tags Undocumented
Instance Variable sourceline Undocumented
Instance Variable sourcepos Undocumented
Property children Iterate over all direct children of this PageElement.
Property descendants Iterate over all children of this PageElement in a breadth-first sequence.
Property is_empty_element Is this tag an empty-element tag? (aka a self-closing tag)
Property string Convenience property to get the single string within this PageElement.
Method _all_strings Yield all strings of certain classes, possibly stripping them.
Method _should_pretty_print Should this tag be pretty-printed?

Inherited from PageElement (via Tag):

Method append Appends the given PageElement to the contents of this one.
Method extend Appends the given PageElements to this one's contents.
Method extract Destructively rips this element out of the tree.
Method find_all_next Find all PageElements that match the given criteria and appear later in the document than this PageElement.
Method find_all_previous Look backwards in the document from this PageElement and find all PageElements that match the given criteria.
Method find_next Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.
Method find_next_sibling Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.
Method find_next_siblings Find all siblings of this PageElement that match the given criteria and appear later in the document.
Method find_parent Find the closest parent of this PageElement that matches the given criteria.
Method find_parents Find all parents of this PageElement that match the given criteria.
Method find_previous Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.
Method find_previous_sibling Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.
Method find_previous_siblings Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.
Method format_string Format the given string using the given formatter.
Method formatter_for_name Look up or create a Formatter for the given identifier, if necessary.
Method get_text Get all child strings of this PageElement, concatenated using the given separator.
Method insert Insert a new PageElement in the list of this PageElement's children.
Method nextGenerator Undocumented
Method nextSiblingGenerator Undocumented
Method parentGenerator Undocumented
Method previousGenerator Undocumented
Method previousSiblingGenerator Undocumented
Method replace_with Replace this PageElement with one or more PageElements, keeping the rest of the tree the same.
Method setup Sets up the initial relations between this element and other elements.
Method unwrap Replace this PageElement with its contents.
Method wrap Wrap this PageElement inside another one.
Class Variable default Undocumented
Class Variable nextSibling Undocumented
Class Variable previousSibling Undocumented
Class Variable text Undocumented
Instance Variable next_element Undocumented
Instance Variable next_sibling Undocumented
Instance Variable parent Undocumented
Instance Variable previous_element Undocumented
Instance Variable previous_sibling Undocumented
Property decomposed Check whether a PageElement has been decomposed.
Property next The PageElement, if any, that was parsed just after this one.
Property next_elements All PageElements that were parsed after this one.
Property next_siblings All PageElements that are siblings of this one but were parsed later.
Property parents All PageElements that are parents of this PageElement.
Property previous The PageElement, if any, that was parsed just before this one.
Property previous_elements All PageElements that were parsed before this one.
Property previous_siblings All PageElements that are siblings of this one but were parsed earlier.
Property stripped_strings Yield all strings in this PageElement, stripping them first.
Method _find_all Iterates over a generator looking for things that match.
Method _find_one Undocumented
Method _last_descendant Finds the last element beneath this object to be parsed.
Property _is_xml Is this element part of an XML tree or an HTML tree?
def __copy__(self): (source)

Copy a BeautifulSoup object by converting the document to a string and parsing it again.

def __getstate__(self): (source)

Undocumented

def __init__(self, markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs): (source)

Constructor. :param markup: A string or a file-like object representing markup to be parsed. :param features: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments. :param builder: A TreeBuilder subclass to instantiate (or instance to use) instead of looking one up based on `features`. You only need to use this if you've implemented a custom TreeBuilder. :param parse_only: A SoupStrainer. Only parts of the document matching the SoupStrainer will be considered. This is useful when parsing part of a document that would otherwise be too large to fit into memory. :param from_encoding: A string indicating the encoding of the document to be parsed. Pass this in if Beautiful Soup is guessing wrongly about the document's encoding. :param exclude_encodings: A list of strings indicating encodings known to be wrong. Pass this in if you don't know the document's encoding but you know Beautiful Soup's guess is wrong. :param element_classes: A dictionary mapping BeautifulSoup classes like Tag and NavigableString, to other classes you'd like to be instantiated instead as the parse tree is built. This is useful for subclassing Tag or NavigableString to modify default behavior. :param kwargs: For backwards compatibility purposes, the constructor accepts certain keyword arguments used in Beautiful Soup 3. None of these arguments do anything in Beautiful Soup 4; they will result in a warning and then be ignored. Apart from this, any keyword arguments passed into the BeautifulSoup constructor are propagated to the TreeBuilder constructor. This makes it possible to configure a TreeBuilder by passing in arguments, not just by saying which one to use.

def decode(self, pretty_print=False, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal'): (source)

Returns a string or Unicode representation of the parse tree as an HTML or XML document. :param pretty_print: If this is True, indentation will be used to make the document more readable. :param eventual_encoding: The encoding of the final document. If this is None, the document will be a Unicode string.

def endData(self, containerClass=None): (source)

Method called by the TreeBuilder when the end of a data segment occurs.

def handle_data(self, data): (source)

Called by the tree builder when a chunk of textual data is encountered.

def handle_endtag(self, name, nsprefix=None): (source)

Called by the tree builder when an ending tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag.

def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None, sourcepos=None, namespaces=None): (source)

Called by the tree builder when a new tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag. :param attrs: A dictionary of attribute values. :param sourceline: The line number where this tag was found in its source document. :param sourcepos: The character position within `sourceline` where this tag was found. :param namespaces: A dictionary of all namespace prefix mappings currently in scope in the document. If this method returns None, the tag was rejected by an active SoupStrainer. You should proceed as if the tag had not occurred in the document. For instance, if this was a self-closing tag, don't call handle_endtag.

def insert_after(self, *args): (source)

This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.

def insert_before(self, *args): (source)

This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.

def new_string(self, s, subclass=None): (source)

Create a new NavigableString associated with this BeautifulSoup object.

def new_tag(self, name, namespace=None, nsprefix=None, attrs={}, sourceline=None, sourcepos=None, **kwattrs): (source)

Create a new Tag associated with this BeautifulSoup object. :param name: The name of the new Tag. :param namespace: The URI of the new Tag's XML namespace, if any. :param prefix: The prefix for the new Tag's XML namespace, if any. :param attrs: A dictionary of this Tag's attribute values; can be used instead of `kwattrs` for attributes like 'class' that are reserved words in Python. :param sourceline: The line number where this tag was (purportedly) found in its source document. :param sourcepos: The character position within `sourceline` where this tag was (purportedly) found. :param kwattrs: Keyword arguments for the new Tag's attribute values.

def object_was_parsed(self, o, parent=None, most_recent_element=None): (source)

Method called by the TreeBuilder to integrate an object into the parse tree.

def popTag(self): (source)

Internal method called by _popToTag when a tag is closed.

def pushTag(self, tag): (source)

Internal method called by handle_starttag when a tag is opened.

def reset(self): (source)

Reset this object to a state as though it had never parsed any markup.

def string_container(self, base_class=None): (source)

Undocumented

ASCII_SPACES: str = (source)

Undocumented

Value
''' 
\t\f\r'''
DEFAULT_BUILDER_FEATURES: list[str] = (source)

Undocumented

Value
['html', 'fast']
NO_PARSER_SPECIFIED_WARNING: str = (source)

Undocumented

Value
'''No parser was explicitly specified, so I\'m using the best available %(markup
_type)s parser for this system ("%(parser)s"). This usually isn\'t a problem, bu
t if you run this code on another system, or in a different virtual environment,
 it may use a different parser and behave differently.

The code that caused this warning is on line %(line_number)s of the file %(filen
ame)s. To get rid of this warning, pass the additional argument \'features="%(pa
...
ROOT_TAG_NAME: str = (source)

Undocumented

Value
'[document]'

Undocumented

current_data: list = (source)

Undocumented

currentTag = (source)

Undocumented

element_classes = (source)

Undocumented

Undocumented

Undocumented

known_xml = (source)

Undocumented

Undocumented

open_tag_counter = (source)

Undocumented

parse_only = (source)

Undocumented

preserve_whitespace_tag_stack: list = (source)

Undocumented

string_container_stack: list = (source)

Undocumented

tagStack: list = (source)

Undocumented

@classmethod
def _decode_markup(cls, markup): (source)

Ensure `markup` is bytes so it's safe to send into warnings.warn. TODO: warnings.warn had this problem back in 2010 but it might not anymore.

@classmethod
def _markup_is_url(cls, markup): (source)

Error-handling method to raise a warning if incoming markup looks like a URL. :param markup: A string. :return: Whether or not the markup resembles a URL closely enough to justify a warning.

@classmethod
def _markup_resembles_filename(cls, markup): (source)

Error-handling method to raise a warning if incoming markup resembles a filename. :param markup: A bytestring or string. :return: Whether or not the markup resembles a filename closely enough to justify a warning.

def _feed(self): (source)

Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.

def _linkage_fixer(self, el): (source)

Make sure linkage of this fragment is sound.

def _popToTag(self, name, nsprefix=None, inclusivePop=True): (source)

Pops the tag stack up to and including the most recent instance of the given tag. If there are no open tags with the given name, nothing will be popped. :param name: Pop up to the most recent tag with this name. :param nsprefix: The namespace prefix that goes with `name`. :param inclusivePop: It this is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.

_most_recent_element = (source)

Undocumented

_namespaces = (source)

Undocumented