class UnicodeDammit: (source)
A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.
Class Method | detwingle |
Fix characters from one encoding embedded in some other encoding. |
Method | __init__ |
Constructor. |
Method | find |
Convert the name of a character set to a codec name. |
Constant | CHARSET |
Undocumented |
Constant | ENCODINGS |
Undocumented |
Constant | FIRST |
Undocumented |
Constant | LAST |
Undocumented |
Constant | MS |
Undocumented |
Constant | MS |
Undocumented |
Constant | MULTIBYTE |
Undocumented |
Constant | WINDOWS |
Undocumented |
Instance Variable | contains |
Undocumented |
Instance Variable | detector |
Undocumented |
Instance Variable | is |
Undocumented |
Instance Variable | log |
Undocumented |
Instance Variable | markup |
Undocumented |
Instance Variable | original |
Undocumented |
Instance Variable | smart |
Undocumented |
Instance Variable | tried |
Undocumented |
Instance Variable | unicode |
Undocumented |
Property | declared |
If the markup is an HTML document, returns the encoding declared _within_ the document. |
Method | _codec |
Undocumented |
Method | _convert |
Attempt to convert the markup to the proposed encoding. |
Method | _sub |
Changes a MS smart quote character to an XML or HTML entity, or an ASCII character. |
Method | _to |
Given a string and its encoding, decodes the string into Unicode. |
def detwingle(cls, in_bytes, main_encoding='utf8', embedded_encoding='windows-1252'): (source) ¶
Fix characters from one encoding embedded in some other encoding. Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8. :param in_bytes: A bytestring that you suspect contains characters from multiple encodings. Note that this _must_ be a bytestring. If you've already converted the document to Unicode, you're too late. :param main_encoding: The primary encoding of `in_bytes`. :param embedded_encoding: The encoding that was used to embed characters in the main document. :return: A bytestring in which `embedded_encoding` characters have been converted to their `main_encoding` equivalents.
Constructor. :param markup: A bytestring representing markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead. Setting it to 'xml' will convert them to XML entity references, and setting it to 'html' will convert them to HTML entity references. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be considered, even if the sniffing code thinks they might make sense.
Convert the name of a character set to a codec name. :param charset: The name of a character set. :return: The name of a codec.
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|
Attempt to convert the markup to the proposed encoding. :param proposed: The name of a character encoding.