decodeh - heuristically decode a string or text file

The intention of the evoque.decodeh module (part of the Evoque Templating distribution) is to conveniently combine a number of ideas and techniques for guessing a string's encoding and return a Unicode object. The technique depends on a codec failing to decode -- but just because a codec succeeds to decode a string using a specific encoding does not mean that that encoding is the right one. The same bytes representing non-ascii characters are valid in several encodings and may thus be decoded to represent wrong characters but without giving any Unicode or RoundTripErrors. The decodeh module supports two mechanisms to help increase the chances that the guess is the correct one in the given situation:

The first is control over the encodings to try and in what order. The user may specify a list of encodings to try in a most likely order, with a default such list being provided in the ENCS module variable. More likely encodings, such as locale defaults, are however always tried prior to the encodings in the supplied list. In addition, a single encoding may be explicitly specified -- in which case it is always tried first.

The second is support for adding any number of encoding-specifc checks to be performed after a first guess at an encoding, so with a minimized performance hit, to check whether there might be a better fitting encoding among those still to be tried in the encodings list. If there is, processing will jump forward to this position in the encodings list, but if decoding ends up being unsatisfactory then it will go back and use previous guess. The execution of the sequence of checks on a guessed encoding is handled by the extensible may_do_better mechanism. Checks are user-specifiable via a dictionary that defines a sequence of check functions to call per encoding. By default, the one used is the provided MDB module variable. Each check function does a small well-defined check, and may be called anytime that specific check is needed. All check functions have the same signature, (s, candidenc) and must define two list attributes scopencs and candidencs, and return either None (no likely better candidate) or a candidate for a more appropriate encoding further along.

The heart of the decodeh module is the decode_heuristically() function that returns the 3-tuple: (unicode object, encoding used, whether deleting chars from input was needed to generate a Unicode object)

Two other convenient utilities, decode() and decode_from_file(), are provided for the frequent case where all that is cared for is a unicode object. If any data had to be lost to generate the unicode object, these two functions will by default raise a RoundTripError, a sub-type of UnicodeError. This default behaviour may be modified by specifying the keyword parameter lossy=True when calling either of these two utilities.

The python source code for decodeh (sans module doc string) is below:

__revision__ = "$Id$" import sys, codecs, locale, re if sys.version < '3': bytes = str else: unicode = str class RoundTripError(UnicodeError): pass # for clarity between py2 and py3, the variable name "b" refers to a # py2 or py3 bytestring, and "s" refers to a py2 unicode or a py3 str. # for py3, the values returned by codecs.BOM_* is a bytes object UTF_BOMS = [ (getattr(codecs, 'BOM_UTF8', '\xef\xbb\xbf'), 'utf_8'), (getattr(codecs, 'BOM_UTF16_LE', '\xff\xfe'), 'utf_16_le'), # utf-16 (getattr(codecs, 'BOM_UTF16_BE', '\xfe\xff'), 'utf_16_be'), #(getattr(codecs, 'BOM_UTF32_LE', '\xff\xfe\x00\x00'), 'utf_32_le'), # utf-32 #(getattr(codecs, 'BOM_UTF32_BE', '\x00\x00\xfe\xff'), 'utf_32_be') ] def get_bom_encoding(b): """ (b:bytes) -> either((None, None), (bom:bytes, encoding:str)) """ for bom, encoding in UTF_BOMS: if b.startswith(bom): return bom, encoding return None, None def is_lossy(b, enc, s=None): """ (b:bytes, enc:str, s:either(str, None)) -> bool Return False if a decode/encode roundtrip of byte string s does not lose any data. If s is not None, it is expected to be the unicode string given by b.decode(enc). Note that this will, incorrectly, return True for cases where the encoding is ambiguous, e.g. is_lossy("\x1b(BHallo","iso2022_jp"), see comp.lang.python thread "unicode(s, enc).encode(enc) == s ?". """ if s is None: s = b.decode(enc) if s.encode(enc) == s: return False else: return True # may_do_better post-guess checks def may_do_better(b, encodings, guenc, mdb): """ Processes the mdb conf object and returns None or a best candidate """ funcs = mdb.get(guenc) if funcs is None: return None # candidencs not in encodings or appearing before guenc are ignored guenc_index = encodings.index(guenc) for func in funcs: for candidenc in func.candidencs: if not candidenc in encodings: continue if guenc_index > encodings.index(candidenc): continue cenc = func(b, candidenc) if cenc is not None: return cenc def compiled_re(pattern): """ (raw_pattern:either(bytes, str)) -> the compiled RE for pattern For py3, we cannot match a str pattern on a bytes object: TypeError: can't use a string pattern on a bytes-like object Whether pattern is a py2/py3 bytes or unicode pattern, the resulting compiled re always corresponds. """ return re.compile(bytes(pattern.encode())) # ASCII chars in range below never appear in text def _ascii_non_text(b, candidenc): if _ascii_non_text.re.search(b) is not None: return candidenc _ascii_non_text.scopencs = ["ascii"] _ascii_non_text.candidencs = ["BINARY"] _ascii_non_text.re = compiled_re(r"[\x00-\x06\x0b\x0e-\x1f\x7f]") # ISO 2022 ESC sequences are more likely to be used in ISO 2022 encodings def _iso2022_jp_escapes(b, candidenc): if _iso2022_jp_escapes.re.search(b) is not None: return candidenc _iso2022_jp_escapes.scopencs = ["ascii"] _iso2022_jp_escapes.candidencs = ["iso2022_jp"] _iso2022_jp_escapes.re = compiled_re(r"\x1b\(B|\x1b\(J|\x1b\$@|\x1b\$B") # the latin_1 control chars 0x80 to 0x9F (but not 0x85) are displayable in # non-ISO extended ASCII (Mac, IBM PC) most likley candidate being cp1252 def _latin_1_control_chars(b, candidenc): if _latin_1_control_chars.re.search(b) is not None: return candidenc _latin_1_control_chars.scopencs = ["latin_1"] _latin_1_control_chars.candidencs = ["cp1252"] _latin_1_control_chars.re = compiled_re(r"[\x80-\x84\x86-\x9f]") # Chars in range below are more likely to be used as symbols in iso8859_15 def _iso8859_15_symbols(b, candidenc): if _iso8859_15_symbols.re.search(b) is not None: return candidenc _iso8859_15_symbols.scopencs = ["latin_1", "cp1252"] _iso8859_15_symbols.candidencs = ["iso8859_15"] _iso8859_15_symbols.re = compiled_re(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]") # user specifiable parameters - defaults # The default list of encodings to try (after "ascii" and "utf_8"). # Order matters! Encoding names use the corresponding python codec name, # as listed at: http://docs.python.org/lib/standard-encodings.html ENCS = [ "latin_1", # add other iso-8859 encodings "cp1252", # add other Windows/Mac encodings "iso8859_15", "mbcs", "big5", "euc_jp", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_3", "iso2022_jp_2004", "iso2022_jp_ext", "iso2022_kr", "koi8_u", "ptcp154", "shift_jis" ] # Encoding to ignore IGNORE_ENCS = [None, "cp0"] # Dictionary specifying the may_do_better checks per encoding. Whenever any # of the following functions returns a non-None candidenc value, the algorithm # will skip forward to the value's position in the encodings list. # For any function here to be executed, its target candidenc must be in the # list of encodings passed to _decode_heuristically(). MDB = { "ascii": [ # _ascii_non_text, # likely to be binary _iso2022_jp_escapes, ], "latin_1": [ _latin_1_control_chars, _iso8859_15_symbols, ], # may refine with further tests to discern which ISO Latin encoding "cp1252": [ _iso8859_15_symbols, ], # may refine with further tests to discern which Windows/Mac encoding } # user callable utilities def decode_from_file(filename, enc=None, encodings=ENCS, mdb=MDB, lossy=False): """ (filename:str, enc:str, encodings:list, mdb:dict, lossy:bool) -> str Convenient wrapper on decode(str) for reading a text file. """ # We open the file in binary mode, and let the algorithm do the guesswork b = open(filename, 'rb').read() return decode(b, enc=enc, encodings=encodings, mdb=mdb, lossy=lossy) def decode(bs, enc=None, encodings=ENCS, mdb=MDB, lossy=False): """ (bs:either(bytes, str), enc:str, encodings:list, mdb:dict, lossy:bool) -> str Raises RoundTripError when lossy=False and re-encoding the string is not equal to the input string. """ s, enc, loses = decode_heuristically(bs, enc=enc, encodings=encodings, mdb=mdb) if not lossy and loses: raise RoundTripError("Data loss in decode/encode round trip") else: return s def decode_heuristically(bs, enc=None, encodings=ENCS, mdb=MDB): """ (bs:either(bytes, str), enc:str, encodings:list, mdb:dict) -> (x:unicode, enc:str, lossy:bool) Tries to determine the best encoding to use from a list of specified encodings, and returns the 3-tuple: a unicode object, the encoding used, and whether deleting chars from input was needed to generate a Unicode object. The list of all encodings to be considered is prepared once and is then passed on for actual processing (recursive). """ if isinstance(bs, unicode): # nothing to do return bs, "utf_8", False # At this point, bs is therefore necessarily a byte string. # A priori, the byte string may be in a UTF encoding and may have a BOM # that we may use but that we must also remove. bom, bom_enc = get_bom_encoding(bs) if bom is not None: bs = bs[len(bom):] # Order is important: encodings should be in a *most likely* order. # Thus, we always try first: # a) any caller-provided encoding # b) encoding from UTF BOM # c) ascii, common case and is unambiguous if no errors # d) utf_8 # e) system default encoding # f) any encodings we can glean from the locale precedencs = [enc, bom_enc, "ascii", "utf_8", sys.getdefaultencoding()] try: precedencs.append(locale.getpreferredencoding()) except AttributeError: pass try: precedencs.append(locale.nl_langinfo(locale.CODESET)) except AttributeError: pass try: precedencs.append(locale.getlocale()[1]) except (AttributeError, IndexError): pass try: precedencs.append(locale.getdefaultlocale()[1]) except (AttributeError, IndexError, ValueError): pass # Build list of encodings to process, normalizing on lowercase names # and avoiding any None and duplicate values. precedencs = [ e.lower() for e in precedencs if e not in IGNORE_ENCS ] allencs = [] for e in precedencs: if e not in allencs: allencs.append(e) allencs += [ e for e in encodings if e not in allencs ] # Check integrity of the mdb dict for guenc, funcs in mdb.items(): for func in funcs: assert guenc in func.scopencs return _decode_heuristically(bs, allencs, mdb) def _decode_heuristically(b, allencs, mdb): """ Recursive function to loop over and examine each encoding in allencs. """ eliminencs = [] for enc in allencs: try: # for py3, s may only be a bytes object s = b.decode(enc) except (UnicodeError, LookupError): eliminencs.append(enc) continue else: candidenc = may_do_better(b, allencs, enc, mdb) if candidenc is not None: # recurse to process from candidenc's position in allencs y, yenc, loses = _decode_heuristically(b, allencs[allencs.index(candidenc):], mdb) if not loses: return y, yenc, False return s, enc, False # no enc worked - try again, using "ignore" parameter, return longest if eliminencs: allencs = [ e for e in allencs if e not in eliminencs ] output = [(s.decode(enc, "ignore"), enc) for enc in allencs] output = [(len(s[0]), s) for s in output] output.sort() s, enc = output[-1][1] if not is_lossy(s, enc, s): return s, enc, False else: return s, enc, True

The decodeh module was initially inspired by a similarly named module by Skip Montanaro. These other articles or resources have also been inspirational and helpful:

Your comments are welcome.