UTF-8: Collation
- From: Owen Taylor <otaylor redhat com>
- To: gtk-devel-list gnome org
- Cc: gtk-i18n-list gnome org, trow ximian com, darin bentspoon com
- Subject: UTF-8: Collation
- Date: 27 Jun 2001 12:29:10 -0400
Collation: [ http://bugzilla.gnome.org/show_bug.cgi?id=55836 ]
Ordering strings in a linguistically meaningfully fasion requires more
work than simply ordering by codepoints. In POSIX, the distinction
here is between strcmp() and strcoll()
A standard algorithm for collation is described in Unicode Technical
Standard #10: Unicode Collation Algorithm
http://www.unicode.org/unicode/reports/tr10/
Closely related to the UCA is ISO 14651:
http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651
When collating strings, the options that may be interesting to control
are:
a) Locale. Collation is fairly strongly locale dependent, with locale
differences including treating multiple characters as a single
character for sorting, or vice versa and the treatment of accents.
b) What sort of differences should be counted in the comparison. Most
collation algorithms have a number of separate levels of
comparison, a typical set being:
Level 1) Differences in characters 'e' vs. 'f'
Level 2) Differences in accents 'e acute' vs 'e'
Level 3) Differences in case
Comparison taking into account differences only at the first two
levels is roughly equivalent to using strcasecmp. The levels used
is sometimes (e.g. in Java) called the "strength" of the
comparison.
c) Normalization to perform before the main part of the collatin
algorithm.
When sorting a set of strings, it may be more efficient to form sort
keys for each string that can be sorted by strcmp() rather than using
the full collation process for each string; the POSIX function
for this process is strxfrm.
The ICU and Java API's here are quite similar - there is a collator
object, which is locale specific, and has the strength and
normalization mode set on it as options.
We can't expect to implement something this complicated for GLib-2.0
and get it close to right:
* We have no API currently to represent a "locale" in GLib indepenent
of the C library's locale.
* The code is complicated, and there is a lot of data to accompany
it. About the only way to get things working quickly would be to
add an existing library - probably ICU - as a dependency. But ICU
is very big and complicated to add as a mandatory dependency.
So, I think the right thing to do is to go very simple for GLib-2.0 -
have a g_utf8_collate() function that is specified to collate for the
current locale, using all differences, and perhaps some fixed
normalization, then worry about going with a fancier method later.
The bug referenced above has some ideas about implementing
g_utf8_collate() in terms of the C library's collation functions. How
good can we do in terms of the C library? Well, e.g. GNU libc has an
implementation of ISO 14651 that should produce almost identical
results to the Unicode algorithm for precomposed characters. It
doesn't, however, handle combining characters, so sorting on those is
going to be iffy - we can get around that to some extent by
normalizing to Unicode normalization form C before calling strcoll().
Regards,
Owen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]