Hi all,
I want to start a discussion on how to properly deal with strings and
unicode, especially for python 2, which was triggered by [1]. The goal
should be to come up with a "best practices" section in the Python GTK+
3 Tutorial.
My current understanding is, and please correct me if I'm wrong, that
str is a list of bytes and unicode is a list of code points (32 bit
integer each). Therefore, a unicode object is an abstract representation
of a string independent of the encoding. As GTK+ only supports utf-8
encoded strings you have to encode every unicode object to utf-8 before
supplying it to GTK+. In most cases a unicode object is automatically
converted to utf-8:
label = Gtk.Label()
label.set_text(u"l\xf6\xe6man")
However, "label.get_text()" will return a str (byte representation) that
looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode
representation) in Python 3. This is a pain if you want to retrieve a
string from a widget and concatenate it with a string, such as:
u"F\xfd\xdfe " + l.get_text()
which will give you the infamous UnicodeDecodeError.
Whereas in Python 3 things work fine, you provide a unicode
representation and you get a unicode representation, it is a mess in
Python 2. A working solution is
u"F\xfd\xdfe " + l.get_text().decode("utf-8")
or
u"F\xfc\xdfe ".encode("utf-8") + t
I personally would prefer to work with unicode representations all the
time instead of the byte representation, but I don't know how much we
can change this behavior if we want to preserve API/ABI.
What do you think?
[1]: https://bugzilla.gnome.org/show_bug.cgi?id=663610
--
Best regards,
Sebastian Pölsterl
Attachment:
signature.asc
Description: OpenPGP digital signature