Re: normalized strings in searches, completion, etc.



On 3/23/07, Sven Neumann <sven gimp org> wrote:
Hi,

On Fri, 2007-03-23 at 03:25 +0100, Denis Jacquerye wrote:

> I'm sure there are tones of places where this doesn't work and some
> where it does. But it should work everywhere someone does a search or
> compares strings unless in some specific cases. What's the best way of
> tackling the issue?

It should work if all places where strings all compared would use
g_utf8_collate(). I am surprised that this doesn't seem to be the case.
Perhaps it's an issue that is often overlooked as many developers are
not aware of the pitfalls of working with Unicode texts.

g_utf8_collate() uses G_NORMALIZE_ALL_COMPOSE = G_NORMALIZE_NFKC so it
will find ² and 2 equivalent. Should that be the default for all
searches?

Which is better? Using g_utf8_collate() instead of strcmp() or a
combination of g_utf8_normalize() and then strcmp()?
If g_utf8_normalize() is used, which normalization should be used?

I'm now guessing it should be G_NORMALIZE_NFC =
G_NORMALIZE_DEFAULT_COMPOSE in most cases because this will match
canonically equivalent strings (eg. é and é equivalent) but not
compatibility ones (eg. ² and 2 different). It will also not partially
match things like "Bise" with "Bisé" where the combining diacritic is
at the end of the string.

I'm also guessing g_utf8_collate() is more appropriate for sorting
than for searching.

Cheers,

Denis Moyogo Jacquerye


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]