Re: UTF-8 Functions
- From: Owen Taylor <otaylor redhat com>
- To: Darin Adler <darin bentspoon com>
- Cc: gtk-devel-list gnome org, gtk-i18n-list gnome org, trow ximian com
- Subject: Re: UTF-8 Functions
- Date: 02 Jul 2001 19:18:43 -0400
Darin Adler <darin bentspoon com> writes:
> On Sunday, July 1, 2001, at 06:17 PM, Owen Taylor wrote:
>
> > But if people have other easy-to-implement improvements, I'd be
> > happy to consider them as well.
>
> My first thought is that if we plan to some day implement a more
> sophisticated collation algorithm, it might be good to have a single
> function that combines g_utf8_casefold and g_utf8_collate_key. It
> might be nice if the fact that these are two separate operations today
> doesn't prevent us from doing the casefolded collation efficiently
> later.
Well, remember, the most general interface is something like
what Java provides, something like:
typedef enum {
G_COLLATE_PRIMARY, /* Accents */
G_COLLATE_SECONDARY, /* Case */
G_COLLATE_TERTIARY,
G_COLLATE_IDENTICAL
} GCollateStrength;
g_utf8_collate_key_extended (string, strength, normalization_mode);
But we'd only be able to meaningfully implement a very small
subset of that currently.
g_ut8_collate_key () corresponds roughly to
g_utf8_collate_key_extended (string,
G_COLLATE_TERTIARY,
G_NORMALIZE_ALL_COMPOSE);
And may be reimplemented as something like that in the future. The
question, I guess is whether it is worth adding:
g_ut8_collate_key_casefold (), which is currently
g_utf8_collate_key (g_utf8_casefold (string));
But might eventually be implemented as:
g_utf8_collate_key_extended (string,
G_COLLATE_SECONDARY,
G_NORMALIZE_ALL_COMPOSE);
[ There are issues of correctness here as well as efficiency ]
It's certainly easy enough to do ... just a few lines of code. My
main hesitation is whether we know yet whether that is the right part
of the parameter space to give a special name.
Enough thinking outload... I'll give it some consideration.
> Also, just out of curiosity, I'd like to understand if
> g_utf8_collate_key provides any guarantee about how it will work with
> strings and various normalizations of the same string. Will a
> normalized string collate == the same string before it was normalized?
> For which flavors of normalization?
The two collation functions both perform normalization with
G_NORMALIZE_ALL_COMPOSE as the first step. NORMALIZE_ALL_COMPOSE
is Unicode NFKC - compatibility decomposition followed by
canonical composition. Since:
NKFC(NK<X>(c)) == NFKC(c)
For all normalization forms NF<X>, this means that normalization
before collation has no effect on collation order.
Regards,
Owen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]