UTF-8: Case mapping
- From: Owen Taylor <otaylor redhat com>
- To: gtk-devel-list gnome org
- Cc: gtk-i18n-list gnome org, trow ximian com, darin bentspoon com
- Subject: UTF-8: Case mapping
- Date: 27 Jun 2001 13:33:03 -0400
Case conversion: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]
Issues in case mapping and case folding are described
in Unicode Technical Report #21: Case mapping.
 
  http://www.unicode.org/unicode/reports/tr21/
Some of the less obvious attributes of case mapping:
 * Case mapping is locale sensitive, though total the number of
   locale-sensitive rules is quite small. (Most important one - 
   in Turkish I is paired with dotless i, and i is paired with
   a capital I with a dot.)
 * Case mapping is context-sensitive; for instance, the proper
   lowercase equivalent of the greek sigma depends on whether 
   the letter occurs at the beginning or the end of the word.
 * Case mapping can't be done character by character - for 
   instance, german ß maps to SS in uppercase.
 * Converting to a fixed case is a poor way to do caseless 
   comparison; properly, they should be done using the
   of the unicode collation algorithm ignoring cased variants, 
   but as an approximation, it is possible to use a set of "case 
   folding" rules.
 
   Except for dotted i, doing it this way removes all locale 
   sensitivity - to get around the problem of dotted 
   i, there are two techniques:
    - skip case mapping on i and dotted i altogether
    - map all i and dotted i together
So, the abstract operations are:
 TOUPPER (string, locale)
 TOLOWER (string, locale)
 TOTITLE (string, locale)
 FOLD (string, dotted-i-method)
Since we don't have a method of representing locale in GLib
right now, I think we should start out with:
 g_utf8_toupper (string);  [ priority A ]
 g_utf8_tolower (string);  [ priority A ]
Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later.
It's not much work to add:
 g_utf8_totitle (string); [ priority C ]
Though I don't know any APIS that actually do this currently,
and title case only actually matters for some "compatibility"
characters in Unicode.
A case folding routine is probably also useful. I don't see
offering the choice of dotted-i-method as a good thing - 
I see no way a programmer would know what to pick. IMO,
we should simply pick one - probably the "merge all I's
together method", and have:
 g_utf8_casefold (string); [ priority B ]
There is also the question of "fuzzy" comparison routines -
the equivalent of strcasecmp - we actually have three axes
on which we can ignore differences:
 * Normalization (none, canonical, compat)
 * Case (unfolded, folded)
 * dotted-i-folding method
I _don't_ think we should offer all these possibilities; not
having a sense yet of what the right choices are, I'm inclined
to leave out such fuzzy comparison routines and let people
build what they need out of the primitives.
Regards,
                                        Owen
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]