Re: UTF-8 Functions



Owen Taylor wrote:
> 
> I've now added the following functions to GLib. I'm pretty happy with
> them as encapsulating the basic operations of this nature in a simple
> manner.
> 
> The main change I'm considering at this point is to add a max_len (can
> be -1) parameter to normalize(), casefold(), strup(), strdown(), and
> collate_key(). It makes things just a little bit more complicated, but
> I hate having to g_strndup() a portion of a string, do something, and
> then free the dup'ed string immediately.
> 
> (If you look inside the implementations, you'll see that this is
> a convenience concern, not an efficiency concern at this point!)
> 
> But if people have other easy-to-implement improvements, I'd be
> happy to consider them as well.
> 
> Regards,
>                                         Owen
> 
> /**
>  * g_utf8_normalize:
>  * @str: a UTF-8 encoded string.
>  * @mode: the type of normalization to perform.
>  *
>  * Convert a string into canonical form, standardizing
>  * such issues as whether a character with an accent
>  * is represented as a base character and combining
>  * accent or as a single precomposed characters. You
>  * should generally call g_utf8_normalize before
>  * comparing two Unicode strings.
>  *
>  * The normalization mode %G_NORMALIZE_DEFAULT only
>  * standardizes differences that do not affect the
>  * text content, such as the above-mentioned accent
>  * representation. %G_NORMALIZE_ALL also standardizes
>  * the "compatibility" characters in Unicode, such
>  * as SUPERSCRIPT THREE to the standard forms
>  * (in this case DIGIT THREE). Formatting information
>  * may be lost but for most text operations such
>  * characters should be considered the same.
>  * For example, g_utf8_collate() normalizes
>  * with %G_NORMALIZE_ALL as its first step.
>  *
>  * %G_NORMALIZE_DEFAULT_COMPOSE and %G_NORMALIZE_ALL_COMPOSE
>  * are like %G_NORMALIZE_DEFAULT and %G_NORMALIZE_ALL,
>  * but returned a result with composed forms rather
>  * than a maximally decomposed form. This is often
>  * useful if you intend to convert the string to
>  * a legacy encoding or pass it to a system with
>  * less capable Unicode handling.
>  *
>  * Return value: the string in normalized form
>  **/
> gchar *g_utf8_normalize (const gchar    *str,
>                          GNormalizeMode  mode);
> 
> /**
>  * g_ut8f_strdown:
>  * @string: a UTF-8 encoded string
>  *
>  * Converts all Unicode characters in the string that have a case
>  * to lowercase. The exact manner that this is done depends
>  * on the current locale, and may result in the number of
>  * characters in the string changing.
>  *
>  * Return value: a newly allocated string, with all characters
>  *    converted to lowercase.
>  **/
> gchar *g_utf8_strdown (const gchar *str);
> 
> /**
>  * g_ut8f_strup:
>  * @string: a UTF-8 encoded string
>  *
>  * Converts all Unicode characters in the string that have a case
>  * to uppercase. The exact manner that this is done depends
>  * on the current locale, and may result in the number of
>  * characters in the string increasing. (For instance, the
>  * German ess-zet will be changed to SS.)
>  *
>  * Return value: a newly allocated string, with all characters
>  *    converted to uppercase.
>  **/
> gchar *g_utf8_strup (const gchar *str);
> 
> /**
>  * g_utf8_casefold:
>  * @str: a UTF-8 encoded string
>  *
>  * Converts a string into a form that is independent of case. The
>  * result will not correspond to any particular case, but can be
>  * compared for equality or ordered with the results of calling
>  * g_utf8_casefold() on other strings.
>  *
>  * Note that calling g_utf8_casefold() followed by g_utf8_collate() is
>  * only an approximation to the correct linguistic case insensitive
>  * ordering, though it is a fairly good one. Getting this exactly
>  * right would require a more sophisticated collation function that
>  * takes case sensitivity into account. GLib does not currently
>  * provide such a function.
>  *
>  * Return value: a newly allocated string, that is a
>  *   case independent form of @str.
>  **/
> gchar *g_utf8_casefold (const gchar *str);
> 
> /**
>  * g_utf8_collate:
>  * @str1: a UTF-8 encoded string
>  * @str2: a UTF-8 encoded string
>  *
>  * Compares two strings for ordering using the linguistically
>  * correct rules for the current locale. When sorting a large
>  * number of strings, it will be significantly faster to
>  * obtain collation keys with g_utf8_collate_key() and
>  * compare the keys with strcmp() when sorting instead of
>  * sorting the original strings.
>  *
>  * Return value: -1 if str1 compares before str2, 0 if they
>  *   compare equal, 1 if str1 compares after str2.
>  **/
> gint g_utf8_collate (const gchar *str1, const gchar *str2);
> 
> /**
>  * g_utf8_collate_key:
>  * @str: a UTF-8 encoded string.
>  *
>  * Converts a string into a collation key that can be compared
>  * with other collation keys using strcmp(). The results of
>  * comparing the collation keys of two strings with strcmp()
>  * will always be the same as comparing the two original
>  * keys with g_utf8_collate().
>  *
>  * Return value: a newly allocated string. This string should
>  *   be freed with g_free when you are done with it.
>  **/
> gchar *g_utf8_collate_key (const gchar *str);

Hi Owen,

Would it be practical to allow an extra parameter in the collate related
calls to choose between optional collate sequences (of course the
parameter is practical, but is all the material that goes behind it
available)? In East Asian languages many collate sequences exist -
phonetic, stroke, radical, etc. It's doesn't seem very practical to
switch these by locale, as you often want to switch on the fly when
someone hits the "sort by radical/phonetics/strokes" button.

Regards,
Steve




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]