Unicode And Encoding

Deep dive · part of Python Strings

Python str is a sequence of Unicode code points. Bytes on disk or wire need an encoding, almost always UTF-8. Use str.encode/bytes.decode, never assume the locale default.

Python 3 str holds Unicode code points; bytes hold raw octets on wire or disk. Encoding maps str to bytes; decoding maps bytes to str. UTF-8 is the default interchange format; never assume locale encodings on Windows servers.

Mojibake (cafÃ©) appears when UTF-8 bytes are decoded as Latin-1—fix by re-encoding with the wrong codec and decoding with the right one. Normalization (NFC/NFD) matters when comparing user input or building search indexes.

Production code combines this topic with logging, tests, and clear module boundaries so refactors stay safe when requirements grow.

str.encode('utf-8') and bytes.decode('utf-8') are the primary pair.

errors='replace' or 'strict' control handling of invalid sequences.

unicodedata.normalize forms composed characters consistently.

Code points above BMP appear as single str elements in Python 3.

open(path, encoding='utf-8') avoids platform default surprises.

surrogateescape helps recover damaged filenames on Unix.

JSON and HTTP must declare charset or use UTF-8. When reading legacy CSV, try utf-8 first, fall back with logging. For security, reject unexpected encodings rather than guessing silently in web apps.

Collation and case folding are locale-specific—lower() is not enough for all languages; use locale modules or ICU for sorting human names.

Try utf-8-sig for Excel CSV before chardet—BOM detection prevents silent misreads on Windows exports.

utf-8-sig handles Excel CSV BOM; log first 40 bytes hex when decode fails in ETL.

Read the parent tutorial on pythondeck.com for runnable snippets, then reproduce them locally in a virtual environment with pinned dependency versions matching your deployment target.

When pairing with teammates, agree on one idiomatic pattern per concern—mixed styles in one repo slow reviews and invite subtle integration bugs during merges.

Using default encoding in open() on Windows (cp1252) breaking Linux-generated files.

Comparing normalized vs non-normalized strings for equality.

Mixing str and bytes in concatenation without explicit conversion.

Assuming len(str) counts graphemes users perceive (use grapheme libraries if needed).

Standardize UTF-8 for all project files and APIs.

Normalize to NFC before storing user display names.

Decode at system boundaries once; work in str internally.

Log encoding used when parsing fails, include hex snippet of bytes.

Re-read the examples below with these ideas in mind; change variable names and inputs to match your own project.

The program below demonstrates encode / decode. Read the comments on each line, run the code, then change names or values to see how the output shifts.

# Example: Encode / decode
# Run in the REPL or save as a .py file and execute with python.
s = "café \N{SNOWMAN}"
b = s.encode("utf-8")
print(b, len(b))
print(b.decode("utf-8"))

This sample walks through normalise in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.

# Example: Normalise
# Run in the REPL or save as a .py file and execute with python.
import unicodedata
a = "é"                      # one code point
b = "e\u0301"                # two code points
print(a == b)                              # False
print(unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b))

Here is a hands-on illustration of detect / fix mojibake. Follow the inline comments first; only then execute the snippet and compare the result with what you expected.

# Example: Detect / fix mojibake
# Run in the REPL or save as a .py file and execute with python.
raw = b"caf\xc3\xa9"        # utf-8 bytes
bad = raw.decode("latin-1")  # wrong decoder -> 'cafÃ©'
fix = bad.encode("latin-1").decode("utf-8")
print(fix)

The program below demonstrates utf-8 roundtrip. Read the comments on each line, run the code, then change names or values to see how the output shifts.

# str is Unicode; encode to bytes with an explicit codec
text = "café ☃"  # non-ASCII characters
data = text.encode("utf-8")  # bytes on wire/disk
print(data, len(data))  # b'...' longer than char count
back = data.decode("utf-8")  # bytes -> str
print(back == text)  # True
for ch in text:  # code points
    print(ch, hex(ord(ch)))  # character + code point hex

This sample walks through normalize forms in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.

# Same glyph may be multiple code point sequences — normalize before compare
import unicodedata  # normalization
a = "é"  # single code point e-acute
b = "e\u0301"  # e + combining acute
print(a == b)  # False raw
print(unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b))  # True NFC
print(unicodedata.name(a))  # character name
print(len(a), len(b))  # code point counts differ before normalize

Related deep dives on Python Strings:

String Performance Tips

« back to Python Strings All tutorials