Re: [patch] wide characters support




a-higuti@math.sci.hokudai.ac.jp (Akira Higuchi) writes:

> Hello,
> 
> I've uploaded a patch to ftp.gtk.org.
> 
> gtk-a-higuti-981202-0.patch.gz
> 
> This patch adds support for wide characters to gdk, and adds
> support for Asian languages to GtkText.
> 
> I have some questions, and a suggestion.
> 
> * Will it be applied? If not, what's the problem?

I've been working for the last few days on integrating
the older version of your patch into GTK+, so I was
quite happy to see the new version.

But I was a bit puzzled. A lot of things that were
in the old patch don't seem to be in the new patch.A few
examples (I can provide you with the full
incremental diff between the patches if you are 
interested)

 - The "--default-fontset" flag to configure. I tend
   to agree that this wasn't a good thing to add.
   I'd rather see a mechanism like I proposed here
   of looking up a default gtkrc for the current
   locale.

 - Setting LC_CTYPE instead of LC_ALL. 

 - A change to gdk_window_new with colormap allocation.
   (It looks like this one might have been accidental
    in the first patch)

 - The removal of gtk_editable_finalize() which
   destroyed editable->ic() (possible for a second
   time)

 - enter/leave handlers for the text and entry widgets,
   which reset the point location.

 - A change to the ordering of gtk_widget_unrealize(),
   to prevent windows from being destroyed before
   their input contexts.

Actually many of these changes are changes that I haven't applied
yet to my copy, because I didn't like them quite as much as the
rest of the patch, but I'm not sure if their omission from your
second patch was intentional or not.
 
> * What is the best way to determine the definition of the type
>   of wide characters?
> 
>   In my code, GdkWchar (the type of wide characters) is always
>   guint32, and size conversions are performed if wchar_t does not
>   equal to guint32. The reason why I had to use such an ugly way
>   is that I couldn't find a good way of getting the size of
>   wchar_t works even if X_LOCALE is used. Of course it's easy to
>   add a test to configure.in, but it seems that there are no
>   other test which records system-dependent settings onto header
>   files that will be installed. Or should the definition of
>   GdkWchar be in glibconfig.h?

We can install a system-dependent header file for GDK
in /usr/local/lib/gtk/include. (We have a bunch of other
GDK-specifc XIM stuff in glibconfig.h, but I don't want
to continue that trend)
 
> * Why Unicode?
> 
>   I heard that gtk+ is inclined toward Unicode, but I can't agree
>   on it. I understand that there are some advantages of settling
>   a particular coding (e.g., Unicode) for wide characters in gtk+,
>   but such a change will be inconvenient for Asian users.
> 
>   As you know, a Chinese character is an ideograph, i.e., each
>   character itself has meaning and nearly corresponds to a word
>   of phonograms. So the idea 'all the characters' does not make
>   sense. 

I don't understand this. Although in the past new ideographs
were created in China and Japan, this does not happen any more
as far as I know.

I don't think it is a bad assumption at all that all CJK
characters that will ever be used for anything but transcriptions
of a few rare old manuscripts are currently in Unicode. I believe
there is work being done to add some rare characters to the the
part of the UCS-4 code space outside of the first 65536
characters.

Korean is a bit different in that Hangul are formed via rules, so
there are theoretically some (unpronounceable) syllables that
cannot be written in Unicode, except via combining forms. But
again work is being done to add these unpronouncable characters
as extensions to Unicode.

I think it can be pretty safely assummed that any interesting
future character set extensions will be done within the
framework of UCS-4. After all, 4 billion characters is 
a lot of space!

>   In general, the best way to handle such a mutable set
>   of characters is to keep programs independent of a paticular
>   encoding. The mb* and wc* functions defined in ISO C and XPG4
>   are ideal for this purpose, and Xmb* and Xwc* functions in
>   Xlib are so, too. Indeed Unicode is useful for the purpose of
>   implementing ISO C i18n functions (for instance, glibc2 uses
>   UCS4), but we had better not make codes depend on Unicode but
>   use mb/wc functions. As long as we keep this policy, we can
>   take prompt measures to support a new character set in the
>   future. I guess this is why Unicode has a bad reputation in
>   Japan. Indeed locale dependent multibyte encodings are
>   troublesome a bit, but their difficulty is reasonable. If you
>   feel multibyte strings too inconvenient, you have only to use
>   wide characters instead.
> 
>   Well, I think right-to-ieft languages can be supported using
>   X Output Method. XOM can deal with context dependent drawings,
>   too. I don't know whether there exists an implementation or
>   not, but the best way to support such languages is to use
>   these functions in either case.

There are quite a few reasons for using Unicode:

 - Properties (such as the directionality) can be determined
   for an arbitrary Unicode character. They cannot for
   a character in an unknown character set.

 - The multibyte UTF-8 encoding has a number of nice properties
   that are not found for arbitrary encodings. For instance,
   it is possible to iterate backwards through an Unicode 
   string starting in the middle, but it is not possible to
   do so using the ISO C functions for moving through strings.

 - If the multibyte encoding is known to be UTF-8, multibyte
   strings can be manipulated much more efficienctly.

 - Unicode standardizes a number of issues (like bidirectional 
   text) that would otherwise have to be treated ad-hoc
   for each locale.

 - Unicode has good support for languages (for instance
   the languages of South Asia) where there is minimal
   existing support within X on a locale-specific basis.

 - The ISO C functions are tied to the current locale.
   With Unicode, there is a least a good hope that
   text will be displayed approximately correctly 
   in any locale.

 - Unicode is becoming a very wide spread standard:

   - It is used by Java, Tcl/Tk, Microsoft, etc, etc.
   - It is a requirement for all future RFC's
   - To my knowledge, 
   - It is the internal encoding used in Mozilla, which looks
     like it will be the most important GTK+ application
     for the near future.

Japan already has excellent support within the older framework of
locale-dependent encodings, so I can understand the reluctance to
move to Unicode; but I think Unicode is much more flexible when
it comes to moving beyond Roman+CJK.

Also, because it is a single well-defined standard, by moving
to Unicode to convince developers to internationalize there
applications from the start, which should save having
to "Japanize" applications on a piecemeal fashion.
 
>   IMHO, we should
>   (1) do without ISO C mb/wc functions as far as support for
>       X_LOCALE is needed. In my code, gdk_wcstombs() and
>       gdk_mbstowcs() exist for this purpose.
>   (2) next, switch over from above functions to ISO C mb/wc
>       functions (and XPG4 functions if needed).
>   Or it may OK to skip (1).

I don't see the ISO C functions as being at all dependable.
For older systems you have to fall back to X_LOCALE,
which has its own problems. (Binary incompatbibilit
Older systems won't support  scripts on older
systems, and some systems (such as glibc) may make rather
idiosynchratic interpretations of them.

This is why I'm going with your patch that moves to wide
character support for now. But in general, wide-character
support is much less attractive than multibyte for most
applications - it causes either a 2x or 4x space overhead
for ASCII text, and is not backwards compatible with
existing code that manipulates ASCII.

Regards,
                                        Owen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]