Re: strcasecmp/tolower/toupper breakage



Christopher James Lahey <clahey ximian com> writes:

> On 04 May 2001 00:06:21 -0400, Havoc Pennington wrote:
> > It's far worse than you think - strcoll() doesn't work on
> > UTF-8. What's needed is a UTF-8 strcoll() implementation.
> > 
> > We punted this out of glib 2, it's really hard to implement. :-(
> > 
> > The cheesy way is to setlocale() to current locale, convert the
> > strings to locale encoding, compare, restore locale. But it's not
> > thread safe and it's butt slow. So not really acceptable.
> 
> This sounds like a tough way to do this to me, but it may be the only
> way.  What if we just take the code for doing strcoll out of glibc and
> write utf8_strcoll?  It would just use all the locale specific
> information we can find in glibc.

I doubt this is even remotely feasible. The strcoll implementation
in GLIBC is "only" 500 lines of code, but it uses information
from the locale definition files that is not publically accessible -
certainly not portably publically accessible.
 
> I'm not sure how glibc is set up to be expansible with respect to things
> like strcoll, but we could take a look at doing it similarly.

It uses large amounts of locale data (look at /usr/share/i18n,
/usr/lib/locale); we don't want to duplicate this information.

What we could do for GNU libc 2.2 is use the experimental extended
locale model stuff, which provides a __strcol_l which takes
an extra __locale_t argument.

I'm not sure how you find an available UTF-8 locale matching the
current locale; many GNU libc systems will have pretty complete
coverage, but there is no guarantee of this. (I believe for Red Hat
7.1, we have complete UTF-8 locale coverage, but that's 18 megs of
data, and Debian, for instance, has a system where the user chooses
what locales to install.)

Worse is that this is, of course, completely non-portable to
other operating systems so we'd have to have a fallback, hopefully
better than #define g_utf8_strcoll g_utf8_strcmp.

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]