class documentation

class EncodingDetector: (source)

View In Hierarchy

Suggests a number of possible encodings for a bytestring. Order of precedence: 1. Encodings you specifically tell EncodingDetector to try first (the known_definite_encodings argument to the constructor). 2. An encoding determined by sniffing the document's byte-order mark. 3. Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the user_encodings argument to the constructor). 4. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.) 5. An encoding detected through textual analysis by chardet, cchardet, or a similar external library. 4. UTF-8. 5. Windows-1252.

Class Method find_declared_encoding Given a document, tries to find its declared encoding.
Class Method strip_byte_order_mark If a byte-order mark is present, strip it and return the encoding it implies.
Method __init__ Constructor.
Instance Variable chardet_encoding Undocumented
Instance Variable declared_encoding Undocumented
Instance Variable exclude_encodings Undocumented
Instance Variable is_html Undocumented
Instance Variable known_definite_encodings Undocumented
Instance Variable markup Undocumented
Instance Variable sniffed_encoding Undocumented
Instance Variable user_encodings Undocumented
Property encodings Yield a number of encodings that might work for this markup.
Method _usable Should we even bother to try this encoding?
@classmethod
def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False): (source)

Given a document, tries to find its declared encoding. An XML encoding is declared at the beginning of the document. An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document. :param markup: Some markup. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param search_entire_document: Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.

@classmethod
def strip_byte_order_mark(cls, data): (source)

If a byte-order mark is present, strip it and return the encoding it implies. :param data: Some markup. :return: A 2-tuple (modified data, implied encoding)

def __init__(self, markup, known_definite_encodings=None, is_html=False, exclude_encodings=None, user_encodings=None, override_encodings=None): (source)

Constructor. :param markup: Some markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be tried, even if they otherwise would be.

chardet_encoding = (source)

Undocumented

declared_encoding = (source)

Undocumented

exclude_encodings = (source)

Undocumented

Undocumented

known_definite_encodings = (source)

Undocumented

Undocumented

sniffed_encoding = (source)

Undocumented

user_encodings = (source)

Undocumented

Yield a number of encodings that might work for this markup. :yield: A sequence of strings.

def _usable(self, encoding, tried): (source)

Should we even bother to try this encoding? :param encoding: Name of an encoding. :param tried: Encodings that have already been tried. This will be modified as a side effect.