class documentation

class UnicodeDammit: (source)

View In Hierarchy

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

Class Method detwingle Fix characters from one encoding embedded in some other encoding.
Method __init__ Constructor.
Method find_codec Convert the name of a character set to a codec name.
Constant CHARSET_ALIASES Undocumented
Constant ENCODINGS_WITH_SMART_QUOTES Undocumented
Constant FIRST_MULTIBYTE_MARKER Undocumented
Constant LAST_MULTIBYTE_MARKER Undocumented
Constant MS_CHARS Undocumented
Constant MS_CHARS_TO_ASCII Undocumented
Constant MULTIBYTE_MARKERS_AND_SIZES Undocumented
Constant WINDOWS_1252_TO_UTF8 Undocumented
Instance Variable contains_replacement_characters Undocumented
Instance Variable detector Undocumented
Instance Variable is_html Undocumented
Instance Variable log Undocumented
Instance Variable markup Undocumented
Instance Variable original_encoding Undocumented
Instance Variable smart_quotes_to Undocumented
Instance Variable tried_encodings Undocumented
Instance Variable unicode_markup Undocumented
Property declared_html_encoding If the markup is an HTML document, returns the encoding declared _within_ the document.
Method _codec Undocumented
Method _convert_from Attempt to convert the markup to the proposed encoding.
Method _sub_ms_char Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.
Method _to_unicode Given a string and its encoding, decodes the string into Unicode.
@classmethod
def detwingle(cls, in_bytes, main_encoding='utf8', embedded_encoding='windows-1252'): (source)

Fix characters from one encoding embedded in some other encoding. Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8. :param in_bytes: A bytestring that you suspect contains characters from multiple encodings. Note that this _must_ be a bytestring. If you've already converted the document to Unicode, you're too late. :param main_encoding: The primary encoding of `in_bytes`. :param embedded_encoding: The encoding that was used to embed characters in the main document. :return: A bytestring in which `embedded_encoding` characters have been converted to their `main_encoding` equivalents.

def __init__(self, markup, known_definite_encodings=[], smart_quotes_to=None, is_html=False, exclude_encodings=[], user_encodings=None, override_encodings=None): (source)

Constructor. :param markup: A bytestring representing markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead. Setting it to 'xml' will convert them to XML entity references, and setting it to 'html' will convert them to HTML entity references. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be considered, even if the sniffing code thinks they might make sense.

def find_codec(self, charset): (source)

Convert the name of a character set to a codec name. :param charset: The name of a character set. :return: The name of a codec.

CHARSET_ALIASES: dict[str, str] = (source)

Undocumented

Value
{'macintosh': 'mac-roman', 'x-sjis': 'shift-jis'}
ENCODINGS_WITH_SMART_QUOTES: list[str] = (source)

Undocumented

Value
['windows-1252', 'iso-8859-1', 'iso-8859-2']
FIRST_MULTIBYTE_MARKER = (source)

Undocumented

Value
MULTIBYTE_MARKERS_AND_SIZES[0][0]
LAST_MULTIBYTE_MARKER = (source)

Undocumented

Value
MULTIBYTE_MARKERS_AND_SIZES[-1][1]
MS_CHARS: dict = (source)

Undocumented

Value
{b'\x80': ('euro', '20AC'),
 b'\x81': ' ',
 b'\x82': ('sbquo', '201A'),
 b'\x83': ('fnof', '192'),
 b'\x84': ('bdquo', '201E'),
 b'\x85': ('hellip', '2026'),
 b'\x86': ('dagger', '2020'),
...
MS_CHARS_TO_ASCII: dict = (source)

Undocumented

Value
{b'\x80': 'EUR',
 b'\x81': ' ',
 b'\x82': ',',
 b'\x83': 'f',
 b'\x84': ',,',
 b'\x85': '...',
 b'\x86': '+',
...
MULTIBYTE_MARKERS_AND_SIZES: list = (source)

Undocumented

Value
[(194, 223, 2), (224, 239, 3), (240, 244, 4)]
WINDOWS_1252_TO_UTF8: dict[int, bytes] = (source)

Undocumented

Value
{128: b'\xe2\x82\xac',
 130: b'\xe2\x80\x9a',
 131: b'\xc6\x92',
 132: b'\xe2\x80\x9e',
 133: b'\xe2\x80\xa6',
 134: b'\xe2\x80\xa0',
 135: b'\xe2\x80\xa1',
...
contains_replacement_characters: bool = (source)

Undocumented

detector = (source)

Undocumented

Undocumented

Undocumented

Undocumented

original_encoding = (source)

Undocumented

smart_quotes_to = (source)

Undocumented

tried_encodings: list = (source)

Undocumented

unicode_markup = (source)

Undocumented

@property
declared_html_encoding = (source)

If the markup is an HTML document, returns the encoding declared _within_ the document.

def _codec(self, charset): (source)

Undocumented

def _convert_from(self, proposed, errors='strict'): (source)

Attempt to convert the markup to the proposed encoding. :param proposed: The name of a character encoding.

def _sub_ms_char(self, match): (source)

Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.

def _to_unicode(self, data, encoding, errors='strict'): (source)

Given a string and its encoding, decodes the string into Unicode. :param encoding: The name of an encoding.