UTF-8: Collation



Collation: [ http://bugzilla.gnome.org/show_bug.cgi?id=55836 ]

Ordering strings in a linguistically meaningfully fasion requires more
work than simply ordering by codepoints.  In POSIX, the distinction
here is between strcmp() and strcoll()

A standard algorithm for collation is described in Unicode Technical
Standard #10: Unicode Collation Algorithm

  http://www.unicode.org/unicode/reports/tr10/

Closely related to the UCA is ISO 14651:

  http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651

When collating strings, the options that may be interesting to control
are:

 a) Locale. Collation is fairly strongly locale dependent, with locale 
   differences including treating multiple characters as a single 
   character for sorting, or vice versa and the treatment of accents.

 b) What sort of differences should be counted in the comparison. Most
   collation algorithms have a number of separate levels of
   comparison, a typical set being:

    Level 1) Differences in characters 'e' vs. 'f'
    Level 2) Differences in accents 'e acute' vs 'e'
    Level 3) Differences in case

   Comparison taking into account differences only at the first two
   levels is roughly equivalent to using strcasecmp. The levels used
   is sometimes (e.g. in Java) called the "strength" of the
   comparison.

 c) Normalization to perform before the main part of the collatin
   algorithm.
   

When sorting a set of strings, it may be more efficient to form sort
keys for each string that can be sorted by strcmp() rather than using
the full collation process for each string; the POSIX function
for this process is strxfrm.


The ICU and Java API's here are quite similar - there is a collator
object, which is locale specific, and has the strength and
normalization mode set on it as options.

We can't expect to implement something this complicated for GLib-2.0
and get it close to right:

 * We have no API currently to represent a "locale" in GLib indepenent
   of the C library's locale.

 * The code is complicated, and there is a lot of data to accompany
   it.  About the only way to get things working quickly would be to
   add an existing library - probably ICU - as a dependency. But ICU
   is very big and complicated to add as a mandatory dependency.


So, I think the right thing to do is to go very simple for GLib-2.0 -
have a g_utf8_collate() function that is specified to collate for the
current locale, using all differences, and perhaps some fixed
normalization, then worry about going with a fancier method later.

The bug referenced above has some ideas about implementing
g_utf8_collate() in terms of the C library's collation functions.  How
good can we do in terms of the C library? Well, e.g. GNU libc has an
implementation of ISO 14651 that should produce almost identical
results to the Unicode algorithm for precomposed characters.  It
doesn't, however, handle combining characters, so sorting on those is
going to be iffy - we can get around that to some extent by
normalizing to Unicode normalization form C before calling strcoll().

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]