class EncodingDetector: (source)
Suggests a number of possible encodings for a bytestring. Order of precedence: 1. Encodings you specifically tell EncodingDetector to try first (the known_definite_encodings argument to the constructor). 2. An encoding determined by sniffing the document's byte-order mark. 3. Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the user_encodings argument to the constructor). 4. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.) 5. An encoding detected through textual analysis by chardet, cchardet, or a similar external library. 4. UTF-8. 5. Windows-1252.
Class Method | find |
Given a document, tries to find its declared encoding. |
Class Method | strip |
If a byte-order mark is present, strip it and return the encoding it implies. |
Method | __init__ |
Constructor. |
Instance Variable | chardet |
Undocumented |
Instance Variable | declared |
Undocumented |
Instance Variable | exclude |
Undocumented |
Instance Variable | is |
Undocumented |
Instance Variable | known |
Undocumented |
Instance Variable | markup |
Undocumented |
Instance Variable | sniffed |
Undocumented |
Instance Variable | user |
Undocumented |
Property | encodings |
Yield a number of encodings that might work for this markup. |
Method | _usable |
Should we even bother to try this encoding? |
def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False): (source) ¶
Given a document, tries to find its declared encoding. An XML encoding is declared at the beginning of the document. An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document. :param markup: Some markup. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param search_entire_document: Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.
If a byte-order mark is present, strip it and return the encoding it implies. :param data: Some markup. :return: A 2-tuple (modified data, implied encoding)
Constructor. :param markup: Some markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be tried, even if they otherwise would be.