bs4.dammit.EncodingDetector

class documentation

class EncodingDetector: (source)

Suggests a number of possible encodings for a bytestring. Order of precedence: 1. Encodings you specifically tell EncodingDetector to try first (the known_definite_encodings argument to the constructor). 2. An encoding determined by sniffing the document's byte-order mark. 3. Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the user_encodings argument to the constructor). 4. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.) 5. An encoding detected through textual analysis by chardet, cchardet, or a similar external library. 4. UTF-8. 5. Windows-1252.

Class Method	`find_declared_encoding`	Given a document, tries to find its declared encoding.
Class Method	`strip_byte_order_mark`	If a byte-order mark is present, strip it and return the encoding it implies.
Method	`__init__`	Constructor.
Instance Variable	`chardet_encoding`	Undocumented
Instance Variable	`declared_encoding`	Undocumented
Instance Variable	`exclude_encodings`	Undocumented
Instance Variable	`is_html`	Undocumented
Instance Variable	`known_definite_encodings`	Undocumented
Instance Variable	`markup`	Undocumented
Instance Variable	`sniffed_encoding`	Undocumented
Instance Variable	`user_encodings`	Undocumented
Property	`encodings`	Yield a number of encodings that might work for this markup.
Method	`_usable`	Should we even bother to try this encoding?

@classmethod
def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False): (source) ¶

Given a document, tries to find its declared encoding. An XML encoding is declared at the beginning of the document. An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document. :param markup: Some markup. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param search_entire_document: Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.

@classmethod
def strip_byte_order_mark(cls, data): (source) ¶

If a byte-order mark is present, strip it and return the encoding it implies. :param data: Some markup. :return: A 2-tuple (modified data, implied encoding)

def __init__(self, markup, known_definite_encodings=None, is_html=False, exclude_encodings=None, user_encodings=None, override_encodings=None): (source) ¶

Constructor. :param markup: Some markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be tried, even if they otherwise would be.

chardet_encoding = (source) ¶

Undocumented

declared_encoding = (source) ¶

Undocumented

exclude_encodings = (source) ¶

Undocumented

is_html = (source) ¶

Undocumented

known_definite_encodings = (source) ¶

Undocumented

markup = (source) ¶

Undocumented

sniffed_encoding = (source) ¶

Undocumented

user_encodings = (source) ¶

Undocumented

@property
encodings = (source) ¶

Yield a number of encodings that might work for this markup. :yield: A sequence of strings.

def _usable(self, encoding, tried): (source) ¶

Should we even bother to try this encoding? :param encoding: Name of an encoding. :param tried: Encodings that have already been tried. This will be modified as a side effect.