Re: strcasecmp/tolower/toupper breakage

From: Havoc Pennington <hp redhat com>
To: George <jirka 5z com>
Cc: gnome-devel-list gnome org, gnome-hackers gnome org
Subject: Re: strcasecmp/tolower/toupper breakage
Date: 04 May 2001 00:06:21 -0400

George <jirka 5z com> writes: 
> I suggest we get ascii only versions into glib 2.  In fact I suggest
> g_strcasecmp and g_strncasecmp work as ascii only, since there doesn't seem
> to be any legitimite reason for use of a locale specific strcasecmp (again,
> strcoll should be used).
> 

It's far worse than you think - strcoll() doesn't work on
UTF-8. What's needed is a UTF-8 strcoll() implementation.

We punted this out of glib 2, it's really hard to implement. :-(

The cheesy way is to setlocale() to current locale, convert the
strings to locale encoding, compare, restore locale. But it's not
thread safe and it's butt slow. So not really acceptable.

I believe toupper, tolower, etc. just corrupt the hell out of UTF-8
strings so all code using them is flat-out broken as in "causes
segfaults" with GTK 2, unless you know the text is ASCII
only. g_strup(), g_strdown(), etc. are also broken to use on Unicode
since they use toupper, tolower.

There are g_unichar_toupper(), etc. in glib 2 which should be used
instead.

utf8_strcasecmp() is pretty easy to implement using unichar_tolower(),
if you don't change its behavior according to locale.

It might be useful to write either a source code scanner or an
nm-based script to find suspicious locale-dependent code, either by
looking for dependencies on locale-specific C library symbols in the
binary or looking for suspect functions in the source code. Sort of
"i18n-lint." Could also find uses of GdkFont, etc.

Havoc

Follow-Ups:
- Re: strcasecmp/tolower/toupper breakage
  - From: Christopher James Lahey
- Re: strcasecmp/tolower/toupper breakage
  - From: Alan Cox

References:
- strcasecmp/tolower/toupper breakage
  - From: George

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]