Re: g_utf8_collate case sensitivity



Darin Adler <darin bentspoon com> writes:

> on 9/5/01 10:52 AM, Owen Taylor at otaylor redhat com wrote:
> 
> >> g_utf8_collate sorts ASCIIbetically case-sensitively (eg, 'Z' < 'a').
> >> That's a bug, right? (The docs say "Compares two strings for ordering
> >> using the linguistically correct rules for the current locale". I think
> >> the rules for my locale say that "bar" sorts before "Foo".)
> > 
> > The rules for the C locale generally have strcmp() ordering, I think.
> > 
> > g_utf8_collate() is just implemented in terms of strcoll() currently.
> 
> For this very reason, eel_strcoll uses strcoll outside "C" and "POSIX"
> locales, but uses eel_strcmp_case_breaks_ties in the "C" and "POSIX"
> locales.
> 
> It looks like we're still going to need an eel_strcoll (although we'll
> switch to g_utf8_collate and a UTF-8 version of
> eel_strcmp_case_breaks_ties).
> 
> Frankly, I'd strongly suggest providing a function with these kinds of
> semantics in glib -- it was my mistake not to bring this up when my remarks
> suggested g_utf8_collate.

Having a:

 g_utf8_collate_and_fallback_for_c_locale() 

is clearly wrong. g_utf8_collate() should always do what a
user would expect for a human-readable collation

So, if the right thing to do for the C locale is to not use
strcoll() because strcoll() is broken, well, then perhaps
that should be in the implementation.

There is no guarantee at all that g_utf8_strcoll() produces
the same sort order as strcoll() - the implementation in terms
of strcoll() is just what we are doing currently. 

I feel a bit uncomfortable second-guessing strcoll() because:

 - maybe strcoll() in the C locale is implemented to do
   something smarter than strcmp().

 - g_utf8_casefold() isn't exactly speedy.

But it's certainly an implementation of g_utf8_collate() issue
not an issue of missing additional interfaces.

Regards,
                                        Owen  

[ And yes, if you run in the C locale, you are almost certainly
  running a broken system. Most systems have ASCII as the
  character set for C locale. You might question whether you
  want > 128 to be UTF-8 or to be ISO-8859-1, but you most
  likely _don't_ want them to be invalid. 

  Red Hat switched over to always installing a system default
  of some real locale sometime in the 6.x series ... maybe
  6.1. ]





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]