Python and polish letters

1 minute read

I am doing some text processing. Text contains polish letters (e.g.: ę, ą, ł, ź, ó, …) and files have utf-8 encoding. Python 2.7 was giving me errors like: UnicodeDecodeError: 'ascii' codec can't decode byte .

I found much advice out there, but only after combining these three four tips below I was able to get rid of all issues, correctly display polish letters in terminal and write them to files.

1. Define file encoding

# -*- coding: utf-8 -*-

This magic comment goes at the top (first or second line) of .py file to define file encoding.

For more details see : PEP 0263 – Defining Python Source Code Encodings

2. Unicode literals

from __future__ import unicode_literals

Normally I would need to decorate each string literal with u e.g u'ę. Including this just makes my life easier.

3. Read and write file with specified encoding

import codecs
...
def some_function_reading_text_file(in_file):
    with codecs.open(in_file, 'r', 'utf-8') as f:

For reading files that are in utf-8 I had to specify their encoding when reading and writing.

4. str vs unicode

I had to replace some occurrences of str with unicode. The latter produces a unicode string version of the object.

More details here : Python doc # unicode

Updated:

Leave a Comment