Python and polish letters
I am doing some text processing. Text contains polish letters (e.g.: ę, ą, ł, ź, ó, …) and files have utf-8 encoding. Python 2.7 was giving me errors like: UnicodeDecodeError: 'ascii' codec can't decode byte .
I found much advice out there, but only after combining these three four tips below I was able to get rid of all issues, correctly display polish letters in terminal and write them to files.
1. Define file encoding
# -*- coding: utf-8 -*-This magic comment goes at the top (first or second line) of .py file to define file encoding.
For more details see : PEP 0263 – Defining Python Source Code Encodings
2. Unicode literals
from __future__ import unicode_literalsNormally I would need to decorate each string literal with u e.g u'ę. Including this just makes my life easier.
3. Read and write file with specified encoding
import codecs
...
def some_function_reading_text_file(in_file):
with codecs.open(in_file, 'r', 'utf-8') as f:For reading files that are in utf-8 I had to specify their encoding when reading and writing.
4. str vs unicode
I had to replace some occurrences of str with unicode. The latter produces a unicode string version of the object.
More details here : Python doc # unicode

Leave a Comment