Bug #101792...



At the GTK developer IRC meeting yesterday we talked about the API
changes suggested in bug #101792. We reached some conclusion, but
after sleeping on it, I have some doubts... As I think this is an
important issue that has the possibility to cause no end of
compatibility headaches in the future, I would like to see some
wider discussion.

First some background: In general, non-ASCII file names work
relatively well with GLib and GTK+ on Windows, at least file names
that are in the user's language or a related language.

What is broken is that GLib cannot handle file names that aren't
expressable in the system codepage of the machine. For instance, on my
English Windows 2000, g_dir_read_name() can't return file names with
Chinese characters. But on a Chinese machine, it can. Ditto for
Cyrillic/Hebrew/Greek/Arabic/Japanese/Korean.

In my latest patch suggestion to bug #101792 (which is/was just a
suggestion and basis for discussion; it doesn't give me any
particularily warm and fuzzy feelings) I basically suggest that GLib
would handle three kinds of file name encodings:

- UTF-8. Not really used for file names in GLib as such, but when
  talking to GTK. GLib would just have API to convert to/from this and
  the two others.

- The on-disk byte encoding. On Windows the file's Unicode name
  converted to the system codepage when possible. On Unix whatever
  bytes actually are in the file name (might be one consistent
  encoding, or a mess of legacy charsets, depending on how confused
  the users of the machine in question are... Modern Unix users
  hopefully use UTF-8.)

  Note that one can have file names on Windows machines that aren't
  representable in the system codepage, if they contain (Unicode)
  characters that aren't present in that codepage. This issue is what
  bug #101792 is about. Windows uses Unicode internally.
  "Sophisticated" native applications use the wide-character API and
  have no problem with such filenames.

- "gfilename", which on Unix would be the same as the previous one,
  the on-disk encoding. On Windows it would be UTF-8, and correspond
  one-to-one with the actual on-disk encoding used in (modern)
  Windowses, i.e. Unicode in the form of UTF-16.

My patch suggests new API that would take/return such "gfilenames".  I
also suggest wrappers for the relatively few C library functions that
take file names that would take "gfilenames". On Unix the wrappers
wouldn't do any conversion, just call the C library function in
question. On Windows they would convert to wide characters and call
the wide-character version of the C library function.

Owen didn't like having three kinds of file names. I understand
him. He proposed that GLib should start pretending that the Windows
on-disk encoding is UTF-8, so for instance g_dir_read_name() would
return UTF-8 names, g_filename_to/from_utf8() would reduce to
g_strdup(), and the GLib API that takes file names would require them
to be in UTF-8. (All this on Windows; on Unix nothing would change.)

The GLib functions that handle file names would use the wide-character
API on Windows, and thus be able to handle all file names.

I said OK.

But, now I have partly changed my mind. At least, we shouldn't break
binary compatibility. We can't change the ABI of existing entry points
in the GLib DLL. If somebody upgrades GLib on his machine to 2.6,
GLib-using applications will break horribly if g_dir_read_name()
starts returning UTF-8 names (and the application then proceeds to try
to open such file names). I don't want to see the mess this would
cause. GLib has been very good at keeping DLL versions binary
compatible so far.

I think we must keep unchanged ABI for the existing entry points in
the GLib DLL.

What we could well do is require minor source-level changes as apps
are compiled and/or linked with GLib 2.6. When handling file names
that originate from GTK or GLib, applications should change their
open/fopen/stat/rename/etc calls to g_open/g_fopen/g_stat/g_rename/etc
instead, if they want to continue to work with non-ASCII file names on
Windows.

I suggest we use some preprocessor hacks to achieve this. For example,
the entry point g_dir_read_name in the GLib DLL would continue to
return file names in the system codepage. But, the headers for GLib
2.6 would cause an application calling g_dir_read_name() to actually
call something called g_dir_read_name_utf8() on Windows. 

One could even arrange to have open/fopen/stat/etc entries in the GLib
import library (entries that wouldn't be DLL stubs but real object
files), that would print out some link-time warning (is this possible
with ld? I am pretty sure it is possible with Microsoft's linker) (and
at run-time call the corresponding g_ wrapper function). Hmm. 

Argh, this is a mess, my head explodes. Comments, please.

--tml



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]