Text normalization with Python

Intro

Today, I want to discuss a topic that, when I discovered it, significantly improved my workflow. It is the standard library module unicodedata. This module proves exceptionally valuable when parsing data. Let’s delve into why.

Normalization

The unicodedata.normalize(form, unistr, /) function serves to convert Unicode strings into their standardized form.

There are four possible values for form, two for composing and two for decomposing:

NFC: Normalization Form Composed. In this form, characters are composed into their canonical equivalents.
NFKC: Normalization Form Compatibility Composed. Similar to NFC, but it also applies compatibility equivalents.
NFD: Normalization Form Decomposed. In this form, characters are decomposed into their canonical equivalents.
NFKD: Normalization Form Compatibility Decomposed. Similar to NFD, but it also applies compatibility transformations.

Firstly, let’s check some stuff on a Python shell:

>>> len('a')  # as str
1
>>> len('a'.encode())  # as utf8 bytes
1
>>> len('á')  # as str
1
>>> len('á'.encode()) # as utf8 bytes
2

There is still no decomposition here. This occurs because in UTF-8 characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 representation. For example, U+0041 becomes 0x41, which is 01000001 in binary.

All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.

However, for the Python interpreter, as an str type, characters like á and all others always count as one when the len() function is applied to them. This is because at its high level, str doesn’t care about encoding.

What’s a canonical equivalent? This is very common on languages other than English. For example, when you type ' + e, it returns é. So, the single character is composed of two canonical characters. The same happens with characters like á, í, ò, ç, etc.

Now, let’s put unicodedata.normalize() into action:

>>> import unicodedata
>>> char = 'á'
>>> len(char)
1
>>> print('Unicode codepoint:', f"U+{ord(char):04X}")
Unicode codepoint: U+00E1
>>> '\u00E1'
'á'

>>> decomposed = unicodedata.normalize('NFD', char)
>>> len(decomposed)
2
>>> print('Unicode codepoints:', ' '.join([f"U+{ord(c):04X}" for c in decomposed]))
Unicode codepoints: U+0061 U+0301
>>> '\u0061\u0301' == decomposed
True
>>> print(decomposed)
a

As you can see, when we decompose the character into its canonical form, we get two characters as the result (the second one seems invisible). Note that decomposed is still an str type, we didn’t perform any UTF-8 encoding.

If you check U+0061, is the normal a character, while U+0391 is the combining acute accent. Essentially, we are decomposing the character into the keyboard combination we typed.

We can put them together through composition:

>>> composed = unicodedata.normalize('NFC', decomposed)
>>> len(composed)
1
>>> print('Unicode codepoint:', f"U+{ord(composed):04X}")
Unicode codepoint: U+00E1
>>> print(composed)
á

What about the compatibility form (the letter K)? While NFD is primarily concerned about canonical decomposition, NFKD extends by also handling compatibility decomposition. Chars like U+FB03 and U+338F are a vivid example of this. Roughly compatibility means the character can be represented with more than one char:

>>> ffi = '\uFB03'
>>> kg = '\u338F'
>>> print(ffi, kg)
ﬃ ㎏
>>> print(len(ffi), len(kg))
1 1
>>> ffi_nfd_decomposed = unicodedata.normalize('NFD', ffi)
>>> kg_nfd_decomposed = unicodedata.normalize('NFD', kg)
>>> kg_nfkd_decomposed = unicodedata.normalize('NFKD', kg)
>>> ffi_nfkd_decomposed = unicodedata.normalize('NFKD', ffi)
>>> print(f"{ffi_nfd_decomposed=}, {kg_nfd_decomposed=}, {kg_nfkd_decomposed=}, {ffi_nfkd_decomposed=}")
ffi_nfd_decomposed='ﬃ', kg_nfd_decomposed='㎏', kg_nfkd_decomposed='kg', ffi_nfkd_decomposed='ffi'
>>> (ffi_nfd_decomposed, kg_nfd_decomposed) == (ffi, kg)
True
>>> (ffi_nfkd_decomposed, kg_nfkd_decomposed) == (ffi, kg)
False

As demonstrated, with NFD nothing is decomposed at all, while with NFKD these kind of chars are sucessfully decomposed into their component parts.

Categorization

Unicode characters are assigned to specific categories, such as Ll for lowercase letter, Lu for uppercase letter, Zs for space separators, Nd for decimal numbers, Po for other punctuation, and so on. You can find the full category list on the internet.

To check the category of a string, we use the unicodedata.category() function.

>>> unicodedata.category('D')
'Lu'
>>> unicodedata.category('d')
'Ll'
>>> unicodedata.category(' ')
'Zs'
>>> unicodedata.category('6')
'Nd'
>>> unicodedata.category('.')
'Po'

Parsing

Now, onto the interesting part, we can integrate both functions to normalize text while parsing. I’ve used this technique in cases where I receive text from a source that isn’t standardized. Picture a scenario where you have fields like “Country” or “City” accepting free text inputs. Despite the lack of standardization, I still need to match this input to an Enum in my code.

Another valuable use case arises when dealing with text input from users. It’s common to encounter strange characters or emojis, which could potentially cause issues elsewhere in the system. This could be particulary problemtaic if the input pertains to critical fields such as names, addresses, or aliases.

>>> raw_strings = ["Bogotá", "Bogotá D.C.", "Name with emoji \u2665 \u23f3", "Text\x00 with null and \x01 return \x00\n"]
>>> for string in raw_strings:
...     print(string)
... 
Bogotá
Bogotá D.C.
Name with emoji ♥ ⏳
Text with null and  return 

>>> def normalize(string):
...     allowed = ['Ll', 'Lu', 'Zs']
...     return ''.join(
...         char for char in unicodedata.normalize('NFKD', string)
...         if unicodedata.category(char) in allowed
...     )
...
>>> for string in raw_strings:
...     print(normalize(string))
... 
Bogota
Bogota DC
Name with emoji  
Text with null and  return 
>>>