Re: glib utf8 api

From: Owen Taylor <otaylor redhat com>
To: Gregory Sharp <gregsharp geo yahoo com>
Cc: gtk-devel-list gnome org
Subject: Re: glib utf8 api
Date: Mon, 03 Mar 2008 10:38:49 -0500

On Sun, 2008-03-02 at 14:49 -0800, Gregory Sharp wrote:
> Hi, I'm new to glib, and have questions/comments about
> the utf-8 API.
> 
> 1) There seems to be no good way to strncpy a utf8 string 
> into a fixed buffer.  g_strncpy doesn't work, because the 
> last character can get truncated causing an invalid string.  
> g_utf8_strncpy doesn't work either, because I don't know 
> how many characters fit in the buffer.

Doesn't strike me as a useful operation. Easy enough to write
yourself with g_utf8_get_char()/next_char()/g_unichar_to_utf8().

> 2) There seems to be no way to create a "best guess" valid
> string.  g_utf8_validate is nice and all, but if validation 
> fails I still need to create a valid string.  Am I supposed 
> to use g_convert_with_fallback() from UTF-8 to UTF-8?

No, g_convert() needs input in the character set you specify.
The fallback is for characters not in the output character
set.

There are lots of different things you might want to do for
an "force to valid" function:

 - Try to guess the real encoding:
 - Drop invalid sequences
 - Replace invalid sequences with replacement characters or ?
 - Replace invalid sequences with hex escapes 
   (The GLib logging functions do this)

I guess I could see a point for including some function along these
lines in GLib, though it's not too hard to write your own.

> 3) If validated utf8 strings are fundamentally different from 
> unvalidated strings, shouldn't they use a different C type?

I don't think this type of thing usually makes sense.
strlen() takes a char *. It can be used on validated UTF-8,
or on a random sequence of bytes.

> 4) What are the developers' reaction to camel_utf8_getc() 
> on this page: http://www.go-evolution.org/Camel.Misc

Apparently they were useful to the camel authors. However,
from timings I did:

g_utf8_get_char() => g_utf8_get_char_validated()
g_utf8_next_char() => g_utf8_find_next_char()

Are both quite noticeable slowdowns, not to mention other
issues (like keeping your handling of invalid characters
consistent, keeping track of input/output indexes, etc)
when iterating through possibly invalid input. 

Generally, validating at the boundaries is a better approach.

- Owen

Attachment: signature.asc
Description: This is a digitally signed message part

Follow-Ups:
- Re: glib utf8 api
  - From: Gregory Sharp

References:
- glib utf8 api
  - From: Gregory Sharp

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]