I am doing some text processing. Text contains polish letters (e.g.: ę, ą, ł, ź, ó, …) and files have
utf-8 encoding. Python 2.7 was giving me errors like:
UnicodeDecodeError: 'ascii' codec can't decode byte .
I found much advice out there, but only after combining these
three four tips below I was able to get rid of all issues, correctly display polish letters in terminal and write them to files.
1. Define file encoding
This magic comment goes at the top (first or second line) of .py file to define file encoding.
For more details see : PEP 0263 – Defining Python Source Code Encodings
2. Unicode literals
Normally I would need to decorate each string literal with
u'ę. Including this just makes my life easier.
3. Read and write file with specified encoding
For reading files that are in
utf-8 I had to specify their encoding when reading and writing.
4. str vs unicode
I had to replace some occurrences of
unicode. The latter produces a unicode string version of the object.
More details here : Python doc # unicode