Re: New and oddity

From: Owen Taylor <otaylor redhat com>
To: gtk-devel-list redhat com
Cc: gtk-i18n-list redhat com
Subject: Re: New and oddity
Date: 04 Feb 2000 19:02:55 -0500

"Sylvain 'Murdock' Glaize" <mokona@club-internet.fr> writes:

> Hello,
> 
> I'm rather new to this list and to Gtk+, and I'm mainly
> here to look at the development discussions while trying
> to catch the whole Gtk+ code meaning. I'm mainly
> considering the multi-charset problems in Gtk+.
> This mail goes here rather than to i18n list because
> it's more about details of implementation, and not
> long scope i18n considerations.

Well, actually, I'd consider the issues that you are talking about
definitely gtk-i18n issues, though your post isn't off
topic here.
 
> Anyway, I've used Gtk+ and found something odd in
> gdk_text_width_wc() (gdkfont.c). When I used it, it
> always returned 0 (zero). Looking at the code shows
> that if the font is a real 16bit, then the functions
> answers zero.
> 
> Is this by choice ? Or just a lacking feature ?
> (a call to XTextWidth16 here seems to work)

wc == wchar_t == C libraries wide characters

This is something different than a two-byte index into a font
(XChar2b), and treating the two interchangeably won't work.  For
instance GNU libc always uses unicode as its internal wide-character
encoding.

XTextWidth16() expects XChar2b. XwcTextExtents()  takes a 
XFontSet and knows (sometimes, sort of) how to convert
between wchar_t and the appropriate font-specific
encoding.

The way that GDK handles the XChar2b case, is that if
you call gdk_text_width() with a 2-byte font, than
it treats the 'text' argument as a 'XChar2b *' cast
to 'char *'.

It's sort of an ugly solution, and, in retrospect, we
probably should have had separate APIs parallel to
XTextWidth16, etc.

But, at this point, it isn't worth changing. As is discussed
below, trying to work within the traditional X local
model really is pretty screwed, and in future versions
of GTK+ we will be using iso10646 (unicode) as the 
internal character encoding.

> Again in the gdk_char_width_wc, the 16bit case
> is ignored. But a remark in gdk_char_width is
> worrying about the 16bit case (where , for me,
> it shouldn't worry, as gdk_char_width function
> name seems to show that only 8bit coding are
> considered here).

See the comment about char * to XChar2b * above - 
this is why the comment is there, since that doesn't
work for gdk_char_width().

> One other thing, which I haven't seen discussed in the
> archives (neither in the i18n list) is a problem with
> gtk_text. When selecting a word by double-click,
> gtk_text calls isalnum to check word boundaries.
> As expected, if I'm in a non-accentuated locale and
> I want to display accentuated charsets, or even far-east
> charsets, the isalnum() is lost, and the selection is
> half made (that is, it is blocked by accents).

You simply cannot display text in multiple scripts 
simultaneously with the traditional C/X locale
model. If you want accentuated characters to work,
you can just use LC_ALL=en_US all the time, if you 
want to use English, but there is no way this is
going to work with iso8859-2 or East Asian text
at the same time.
 
> As long as I stay in one locale, there's no problem,
> as soon as I want xtext to display different texts from
> different charset, isalnum has no more meaning. Well,
> that's one of the libc limitations considering i18n...
> 
> Is there something planned for that ? I would consider
> to stop on blocking characters (set to be defined,
> but things as space, quotes,...) for western languages.
> Defining blocking characters would work better than
> defining what is a alpha numeric character, as they
> are more commonly shared among western languages.

Akira Higuchi has a patch that makes things a bit better in a
particular locale as far as word an line breaking and word breaking
goes - it tries to guess what the ideographic characters are in the
character set, using heuristics like:

 ideogram == !space && !alnum && !pnct && !cntl

And takes those into account in character in line breaking.

But the heuristics may not be portable, and the algorithms
aren't really linguistically correct - you really need
specialized algorithms that know about things like
what characters are prohibited from beginning and ending the
line.

So, what is the solution that we are going to take?
We are working on writing a general system for handling
the layout and rendering of international text - called
Pango (http://www.pango.org). This is able to load up
modules specific to a language which know the rules
for character, word, and line breaks, and will be able
to handle converting iso10646 text into glyphs in a
clean and consistent matter.

Regards,
                                        Owen
References:
- New and oddity
  - From: Sylvain 'Murdock' Glaize
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]