bs4.dammit.UnicodeDammit

class documentation

class UnicodeDammit: (source)

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

Class Method	`detwingle`	Fix characters from one encoding embedded in some other encoding.
Method	`__init__`	Constructor.
Method	`find_codec`	Convert the name of a character set to a codec name.
Constant	`CHARSET_ALIASES`	Undocumented
Constant	`ENCODINGS_WITH_SMART_QUOTES`	Undocumented
Constant	`FIRST_MULTIBYTE_MARKER`	Undocumented
Constant	`LAST_MULTIBYTE_MARKER`	Undocumented
Constant	`MS_CHARS`	Undocumented
Constant	`MS_CHARS_TO_ASCII`	Undocumented
Constant	`MULTIBYTE_MARKERS_AND_SIZES`	Undocumented
Constant	`WINDOWS_1252_TO_UTF8`	Undocumented
Instance Variable	`contains_replacement_characters`	Undocumented
Instance Variable	`detector`	Undocumented
Instance Variable	`is_html`	Undocumented
Instance Variable	`log`	Undocumented
Instance Variable	`markup`	Undocumented
Instance Variable	`original_encoding`	Undocumented
Instance Variable	`smart_quotes_to`	Undocumented
Instance Variable	`tried_encodings`	Undocumented
Instance Variable	`unicode_markup`	Undocumented
Property	`declared_html_encoding`	If the markup is an HTML document, returns the encoding declared _within_ the document.
Method	`_codec`	Undocumented
Method	`_convert_from`	Attempt to convert the markup to the proposed encoding.
Method	`_sub_ms_char`	Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.
Method	`_to_unicode`	Given a string and its encoding, decodes the string into Unicode.

@classmethod
def detwingle(cls, in_bytes, main_encoding='utf8', embedded_encoding='windows-1252'): (source) ¶

Fix characters from one encoding embedded in some other encoding. Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8. :param in_bytes: A bytestring that you suspect contains characters from multiple encodings. Note that this _must_ be a bytestring. If you've already converted the document to Unicode, you're too late. :param main_encoding: The primary encoding of `in_bytes`. :param embedded_encoding: The encoding that was used to embed characters in the main document. :return: A bytestring in which `embedded_encoding` characters have been converted to their `main_encoding` equivalents.

def __init__(self, markup, known_definite_encodings=[], smart_quotes_to=None, is_html=False, exclude_encodings=[], user_encodings=None, override_encodings=None): (source) ¶

Constructor. :param markup: A bytestring representing markup in an unknown encoding. :param known_definite_encodings: When determining the encoding of `markup`, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined here: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding :param user_encodings: These encodings will be tried after the `known_definite_encodings` have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding :param override_encodings: A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings. :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead. Setting it to 'xml' will convert them to XML entity references, and setting it to 'html' will convert them to HTML entity references. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be considered, even if the sniffing code thinks they might make sense.

def find_codec(self, charset): (source) ¶

Convert the name of a character set to a codec name. :param charset: The name of a character set. :return: The name of a codec.

CHARSET_ALIASES: dict[str, str] = (source) ¶

Undocumented

Value

{'macintosh': 'mac-roman', 'x-sjis': 'shift-jis'}

ENCODINGS_WITH_SMART_QUOTES: list[str] = (source) ¶

Undocumented

Value

['windows-1252', 'iso-8859-1', 'iso-8859-2']

FIRST_MULTIBYTE_MARKER = (source) ¶

Undocumented

Value

MULTIBYTE_MARKERS_AND_SIZES[0][0]

LAST_MULTIBYTE_MARKER = (source) ¶

Undocumented

Value

MULTIBYTE_MARKERS_AND_SIZES[-1][1]

MS_CHARS: dict = (source) ¶

Undocumented

Value

{b'\x80': ('euro', '20AC'),
 b'\x81': ' ',
 b'\x82': ('sbquo', '201A'),
 b'\x83': ('fnof', '192'),
 b'\x84': ('bdquo', '201E'),
 b'\x85': ('hellip', '2026'),
 b'\x86': ('dagger', '2020'),
...

MS_CHARS_TO_ASCII: dict = (source) ¶

Undocumented

Value

{b'\x80': 'EUR',
 b'\x81': ' ',
 b'\x82': ',',
 b'\x83': 'f',
 b'\x84': ',,',
 b'\x85': '...',
 b'\x86': '+',
...

MULTIBYTE_MARKERS_AND_SIZES: list = (source) ¶

Undocumented

Value

[(194, 223, 2), (224, 239, 3), (240, 244, 4)]

WINDOWS_1252_TO_UTF8: dict[int, bytes] = (source) ¶

Undocumented

Value

{128: b'\xe2\x82\xac',
 130: b'\xe2\x80\x9a',
 131: b'\xc6\x92',
 132: b'\xe2\x80\x9e',
 133: b'\xe2\x80\xa6',
 134: b'\xe2\x80\xa0',
 135: b'\xe2\x80\xa1',
...