A data structure representing a parsed HTML or XML document. Most of the methods you'll call on a BeautifulSoup object are inherited from PageElement or Tag. Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you'll need to understand these methods as a whole. These methods will be called by the BeautifulSoup constructor: * reset() * feed(markup) The tree builder may call these methods from its feed() implementation: * handle_starttag(name, attrs) # See note about return value * handle_endtag(name) * handle_data(data) # Appends to the current data node * endData(containerClass) # Ends the current data node No matter how complicated the underlying parser is, you should be able to build a tree using 'start tag' events, 'end tag' events, 'data' events, and "done with data" events. If you encounter an empty-element tag (aka a self-closing tag, like HTML's <br> tag), call handle_starttag and then handle_endtag.
Method | __copy__ |
Copy a BeautifulSoup object by converting the document to a string and parsing it again. |
Method | __getstate__ |
Undocumented |
Method | __init__ |
Constructor. |
Method | decode |
Returns a string or Unicode representation of the parse tree as an HTML or XML document. |
Method | end |
Method called by the TreeBuilder when the end of a data segment occurs. |
Method | handle |
Called by the tree builder when a chunk of textual data is encountered. |
Method | handle |
Called by the tree builder when an ending tag is encountered. |
Method | handle |
Called by the tree builder when a new tag is encountered. |
Method | insert |
This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree. |
Method | insert |
This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree. |
Method | new |
Create a new NavigableString associated with this BeautifulSoup object. |
Method | new |
Create a new Tag associated with this BeautifulSoup object. |
Method | object |
Method called by the TreeBuilder to integrate an object into the parse tree. |
Method | pop |
Internal method called by _popToTag when a tag is closed. |
Method | push |
Internal method called by handle_starttag when a tag is opened. |
Method | reset |
Reset this object to a state as though it had never parsed any markup. |
Method | string |
Undocumented |
Constant | ASCII |
Undocumented |
Constant | DEFAULT |
Undocumented |
Constant | NO |
Undocumented |
Constant | ROOT |
Undocumented |
Instance Variable | builder |
Undocumented |
Instance Variable | current |
Undocumented |
Instance Variable | current |
Undocumented |
Instance Variable | element |
Undocumented |
Instance Variable | hidden |
Undocumented |
Instance Variable | is |
Undocumented |
Instance Variable | known |
Undocumented |
Instance Variable | markup |
Undocumented |
Instance Variable | open |
Undocumented |
Instance Variable | parse |
Undocumented |
Instance Variable | preserve |
Undocumented |
Instance Variable | string |
Undocumented |
Instance Variable | tag |
Undocumented |
Class Method | _decode |
Ensure `markup` is bytes so it's safe to send into warnings.warn. |
Class Method | _markup |
Error-handling method to raise a warning if incoming markup looks like a URL. |
Class Method | _markup |
Error-handling method to raise a warning if incoming markup resembles a filename. |
Method | _feed |
Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects. |
Method | _linkage |
Make sure linkage of this fragment is sound. |
Method | _pop |
Pops the tag stack up to and including the most recent instance of the given tag. |
Instance Variable | _most |
Undocumented |
Instance Variable | _namespaces |
Undocumented |
Inherited from Tag
:
Method | __bool__ |
A tag is non-None even if it has no contents. |
Method | __call__ |
Calling a Tag like a function is the same as calling its find_all() method. Eg. tag('a') returns a list of all the A tags found within this tag. |
Method | __contains__ |
Undocumented |
Method | __delitem__ |
Deleting tag[key] deletes all 'key' attributes for the tag. |
Method | __eq__ |
Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as `other`. |
Method | __getattr__ |
Calling tag.subtag is the same as calling tag.find(name="subtag") |
Method | __getitem__ |
tag[key] returns the value of the 'key' attribute for the Tag, and throws an exception if it's not there. |
Method | __hash__ |
Undocumented |
Method | __iter__ |
Iterating over a Tag iterates over its contents. |
Method | __len__ |
The length of a Tag is the length of its list of contents. |
Method | __ne__ |
Returns true iff this Tag is not identical to `other`, as defined in __eq__. |
Method | __repr__ |
Renders this PageElement as a string. |
Method | __setitem__ |
Setting tag[key] sets the value of the 'key' attribute for the tag. |
Method | __unicode__ |
Renders this PageElement as a Unicode string. |
Method | child |
Deprecated generator. |
Method | clear |
Wipe out all children of this PageElement by calling extract() on them. |
Method | decode |
Renders the contents of this tag as a Unicode string. |
Method | decompose |
Recursively destroys this PageElement and its children. |
Method | encode |
Render a bytestring representation of this PageElement and its contents. |
Method | encode |
Renders the contents of this PageElement as a bytestring. |
Method | find |
Look in the children of this PageElement and find the first PageElement that matches the given criteria. |
Method | find |
Look in the children of this PageElement and find all PageElements that match the given criteria. |
Method | get |
Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute. |
Method | get |
The same as get(), but always returns a list. |
Method | has |
Does this PageElement have an attribute with the given name? |
Method | has |
Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents). |
Method | index |
Find the index of a child by identity, not value. |
Method | prettify |
Pretty-print this PageElement as a string. |
Method | recursive |
Deprecated generator. |
Method | render |
Deprecated method for BS3 compatibility. |
Method | select |
Perform a CSS selection operation on the current element. |
Method | select |
Perform a CSS selection operation on the current element. |
Method | smooth |
Smooth out this element's children by consolidating consecutive strings. |
Method | string |
Replace this PageElement's contents with `string`. |
Constant | DEFAULT |
Undocumented |
Class Variable | parser |
Undocumented |
Class Variable | strings |
Undocumented |
Instance Variable | attrs |
Undocumented |
Instance Variable | can |
Undocumented |
Instance Variable | cdata |
Undocumented |
Instance Variable | contents |
Undocumented |
Instance Variable | interesting |
Undocumented |
Instance Variable | name |
Undocumented |
Instance Variable | namespace |
Undocumented |
Instance Variable | parser |
Undocumented |
Instance Variable | prefix |
Undocumented |
Instance Variable | preserve |
Undocumented |
Instance Variable | sourceline |
Undocumented |
Instance Variable | sourcepos |
Undocumented |
Property | children |
Iterate over all direct children of this PageElement. |
Property | descendants |
Iterate over all children of this PageElement in a breadth-first sequence. |
Property | is |
Is this tag an empty-element tag? (aka a self-closing tag) |
Property | string |
Convenience property to get the single string within this PageElement. |
Method | _all |
Yield all strings of certain classes, possibly stripping them. |
Method | _should |
Should this tag be pretty-printed? |
Inherited from PageElement
(via Tag
):
Method | append |
Appends the given PageElement to the contents of this one. |
Method | extend |
Appends the given PageElements to this one's contents. |
Method | extract |
Destructively rips this element out of the tree. |
Method | find |
Find all PageElements that match the given criteria and appear later in the document than this PageElement. |
Method | find |
Look backwards in the document from this PageElement and find all PageElements that match the given criteria. |
Method | find |
Find the first PageElement that matches the given criteria and appears later in the document than this PageElement. |
Method | find |
Find the closest sibling to this PageElement that matches the given criteria and appears later in the document. |
Method | find |
Find all siblings of this PageElement that match the given criteria and appear later in the document. |
Method | find |
Find the closest parent of this PageElement that matches the given criteria. |
Method | find |
Find all parents of this PageElement that match the given criteria. |
Method | find |
Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. |
Method | find |
Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document. |
Method | find |
Returns all siblings to this PageElement that match the given criteria and appear earlier in the document. |
Method | format |
Format the given string using the given formatter. |
Method | formatter |
Look up or create a Formatter for the given identifier, if necessary. |
Method | get |
Get all child strings of this PageElement, concatenated using the given separator. |
Method | insert |
Insert a new PageElement in the list of this PageElement's children. |
Method | next |
Undocumented |
Method | next |
Undocumented |
Method | parent |
Undocumented |
Method | previous |
Undocumented |
Method | previous |
Undocumented |
Method | replace |
Replace this PageElement with one or more PageElements, keeping the rest of the tree the same. |
Method | setup |
Sets up the initial relations between this element and other elements. |
Method | unwrap |
Replace this PageElement with its contents. |
Method | wrap |
Wrap this PageElement inside another one. |
Class Variable | default |
Undocumented |
Class Variable | next |
Undocumented |
Class Variable | previous |
Undocumented |
Class Variable | text |
Undocumented |
Instance Variable | next |
Undocumented |
Instance Variable | next |
Undocumented |
Instance Variable | parent |
Undocumented |
Instance Variable | previous |
Undocumented |
Instance Variable | previous |
Undocumented |
Property | decomposed |
Check whether a PageElement has been decomposed. |
Property | next |
The PageElement, if any, that was parsed just after this one. |
Property | next |
All PageElements that were parsed after this one. |
Property | next |
All PageElements that are siblings of this one but were parsed later. |
Property | parents |
All PageElements that are parents of this PageElement. |
Property | previous |
The PageElement, if any, that was parsed just before this one. |
Property | previous |
All PageElements that were parsed before this one. |
Property | previous |
All PageElements that are siblings of this one but were parsed earlier. |
Property | stripped |
Yield all strings in this PageElement, stripping them first. |
Method | _find |
Iterates over a generator looking for things that match. |
Method | _find |
Undocumented |
Method | _last |
Finds the last element beneath this object to be parsed. |
Property | _is |
Is this element part of an XML tree or an HTML tree? |
bs4.element.Tag.__copy__
Copy a BeautifulSoup object by converting the document to a string and parsing it again.
bs4.element.Tag.__init__
bs4.BeautifulStoneSoup
Constructor. :param markup: A string or a file-like object representing markup to be parsed. :param features: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments. :param builder: A TreeBuilder subclass to instantiate (or instance to use) instead of looking one up based on `features`. You only need to use this if you've implemented a custom TreeBuilder. :param parse_only: A SoupStrainer. Only parts of the document matching the SoupStrainer will be considered. This is useful when parsing part of a document that would otherwise be too large to fit into memory. :param from_encoding: A string indicating the encoding of the document to be parsed. Pass this in if Beautiful Soup is guessing wrongly about the document's encoding. :param exclude_encodings: A list of strings indicating encodings known to be wrong. Pass this in if you don't know the document's encoding but you know Beautiful Soup's guess is wrong. :param element_classes: A dictionary mapping BeautifulSoup classes like Tag and NavigableString, to other classes you'd like to be instantiated instead as the parse tree is built. This is useful for subclassing Tag or NavigableString to modify default behavior. :param kwargs: For backwards compatibility purposes, the constructor accepts certain keyword arguments used in Beautiful Soup 3. None of these arguments do anything in Beautiful Soup 4; they will result in a warning and then be ignored. Apart from this, any keyword arguments passed into the BeautifulSoup constructor are propagated to the TreeBuilder constructor. This makes it possible to configure a TreeBuilder by passing in arguments, not just by saying which one to use.
bs4.element.Tag.decode
Returns a string or Unicode representation of the parse tree as an HTML or XML document. :param pretty_print: If this is True, indentation will be used to make the document more readable. :param eventual_encoding: The encoding of the final document. If this is None, the document will be a Unicode string.
Called by the tree builder when an ending tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag.
Called by the tree builder when a new tag is encountered. :param name: Name of the tag. :param nsprefix: Namespace prefix for the tag. :param attrs: A dictionary of attribute values. :param sourceline: The line number where this tag was found in its source document. :param sourcepos: The character position within `sourceline` where this tag was found. :param namespaces: A dictionary of all namespace prefix mappings currently in scope in the document. If this method returns None, the tag was rejected by an active SoupStrainer. You should proceed as if the tag had not occurred in the document. For instance, if this was a self-closing tag, don't call handle_endtag.
bs4.element.PageElement.insert_after
This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
bs4.element.PageElement.insert_before
This method is part of the PageElement API, but `BeautifulSoup` doesn't implement it because there is nothing before or after it in the parse tree.
Create a new Tag associated with this BeautifulSoup object. :param name: The name of the new Tag. :param namespace: The URI of the new Tag's XML namespace, if any. :param prefix: The prefix for the new Tag's XML namespace, if any. :param attrs: A dictionary of this Tag's attribute values; can be used instead of `kwattrs` for attributes like 'class' that are reserved words in Python. :param sourceline: The line number where this tag was (purportedly) found in its source document. :param sourcepos: The character position within `sourceline` where this tag was (purportedly) found. :param kwattrs: Keyword arguments for the new Tag's attribute values.
Undocumented
Value |
|
Ensure `markup` is bytes so it's safe to send into warnings.warn. TODO: warnings.warn had this problem back in 2010 but it might not anymore.
Error-handling method to raise a warning if incoming markup looks like a URL. :param markup: A string. :return: Whether or not the markup resembles a URL closely enough to justify a warning.
Error-handling method to raise a warning if incoming markup resembles a filename. :param markup: A bytestring or string. :return: Whether or not the markup resembles a filename closely enough to justify a warning.
Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.
Pops the tag stack up to and including the most recent instance of the given tag. If there are no open tags with the given name, nothing will be popped. :param name: Pop up to the most recent tag with this name. :param nsprefix: The namespace prefix that goes with `name`. :param inclusivePop: It this is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.