Python and polish letters
I am doing some text processing. Text contains polish letters (e.g.: ę, ą, ł, ź, ó, …) and files have utf-8
encoding. Python 2.7 was giving me errors like: UnicodeDecodeError: 'ascii' codec can't decode byte
.
I found much advice out there, but only after combining these three four tips below I was able to get rid of all issues, correctly display polish letters in terminal and write them to files.
1. Define file encoding
# -*- coding: utf-8 -*-
This magic comment goes at the top (first or second line) of .py file to define file encoding.
For more details see : PEP 0263 – Defining Python Source Code Encodings
2. Unicode literals
from __future__ import unicode_literals
Normally I would need to decorate each string literal with u
e.g u'ę
. Including this just makes my life easier.
3. Read and write file with specified encoding
import codecs
...
def some_function_reading_text_file(in_file):
with codecs.open(in_file, 'r', 'utf-8') as f:
For reading files that are in utf-8
I had to specify their encoding when reading and writing.
4. str vs unicode
I had to replace some occurrences of str
with unicode
. The latter produces a unicode string version of the object.
More details here : Python doc # unicode
Leave a Comment