Re: glib utf8 api

From: Gregory Sharp <gregsharp geo yahoo com>
To: Owen Taylor <otaylor redhat com>
Cc: gtk-devel-list gnome org
Subject: Re: glib utf8 api
Date: Tue, 4 Mar 2008 16:24:52 -0800 (PST)

Thanks so much Owen and Bedhad for your response.

> > 1) There seems to be no good way to strncpy a utf8 string 
> > into a fixed buffer.  g_strncpy doesn't work, because the 
> > last character can get truncated causing an invalid string. 
> 
> > g_utf8_strncpy doesn't work either, because I don't know 
> > how many characters fit in the buffer.
> 
> Doesn't strike me as a useful operation. Easy enough to write
> yourself with
> g_utf8_get_char()/next_char()/g_unichar_to_utf8().

May I try to convince you that it is useful?  For good or 
evil, it is still common to copy strings into fixed length
buffers.  That is why functions like strncpy exist in 
the standard C library.  It is not expected that everyone 
write his own strncpy, even though it is easy, because we 
all benefit from using the copy in the library.

> > 2) There seems to be no way to create a "best guess" valid
> > string.  g_utf8_validate is nice and all, but if validation 
> > fails I still need to create a valid string.  Am I supposed 
> > to use g_convert_with_fallback() from UTF-8 to UTF-8?
> 
> No, g_convert() needs input in the character set you specify.
> The fallback is for characters not in the output character
> set.
> 
> There are lots of different things you might want to do for
> an "force to valid" function:
> 
>  - Try to guess the real encoding:
>  - Drop invalid sequences
>  - Replace invalid sequences with replacement characters or ?
>  - Replace invalid sequences with hex escapes 
>    (The GLib logging functions do this)
> 
> I guess I could see a point for including some function along
> these
> lines in GLib, though it's not too hard to write your own.

Here I only see the lower 3 options as being relevant 
for glib, as they are mechanical translations.  Guessing 
the encoding is probably best left to a specialized 
algorithm.

> Generally, validating at the boundaries is a better approach.

I think we agree on this.  Applications need to process 
these text which comes from dirty sources such as user 
files, network, etc.  So, yes, the boundary validation 
exists in the input processing layer.  Part of the boundary 
validation is a correction routine which fixes the 
dirty text.

I appreciate your feedback, and I hope we can find these 
(especially g_utf8_strlcpy!) in a new API version.

Greg


Greg Sharp
gregsharp geocities com


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Follow-Ups:
- Re: glib utf8 api
  - From: Owen Taylor

References:
- Re: glib utf8 api
  - From: Owen Taylor

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]