Re: glib utf8 api
- From: Gregory Sharp <gregsharp geo yahoo com>
- To: Owen Taylor <otaylor redhat com>
- Cc: gtk-devel-list gnome org
- Subject: Re: glib utf8 api
- Date: Tue, 4 Mar 2008 16:24:52 -0800 (PST)
Thanks so much Owen and Bedhad for your response.
> > 1) There seems to be no good way to strncpy a utf8 string
> > into a fixed buffer. g_strncpy doesn't work, because the
> > last character can get truncated causing an invalid string.
>
> > g_utf8_strncpy doesn't work either, because I don't know
> > how many characters fit in the buffer.
>
> Doesn't strike me as a useful operation. Easy enough to write
> yourself with
> g_utf8_get_char()/next_char()/g_unichar_to_utf8().
May I try to convince you that it is useful? For good or
evil, it is still common to copy strings into fixed length
buffers. That is why functions like strncpy exist in
the standard C library. It is not expected that everyone
write his own strncpy, even though it is easy, because we
all benefit from using the copy in the library.
> > 2) There seems to be no way to create a "best guess" valid
> > string. g_utf8_validate is nice and all, but if validation
> > fails I still need to create a valid string. Am I supposed
> > to use g_convert_with_fallback() from UTF-8 to UTF-8?
>
> No, g_convert() needs input in the character set you specify.
> The fallback is for characters not in the output character
> set.
>
> There are lots of different things you might want to do for
> an "force to valid" function:
>
> - Try to guess the real encoding:
> - Drop invalid sequences
> - Replace invalid sequences with replacement characters or ?
> - Replace invalid sequences with hex escapes
> (The GLib logging functions do this)
>
> I guess I could see a point for including some function along
> these
> lines in GLib, though it's not too hard to write your own.
Here I only see the lower 3 options as being relevant
for glib, as they are mechanical translations. Guessing
the encoding is probably best left to a specialized
algorithm.
> Generally, validating at the boundaries is a better approach.
I think we agree on this. Applications need to process
these text which comes from dirty sources such as user
files, network, etc. So, yes, the boundary validation
exists in the input processing layer. Part of the boundary
validation is a correction routine which fixes the
dirty text.
I appreciate your feedback, and I hope we can find these
(especially g_utf8_strlcpy!) in a new API version.
Greg
Greg Sharp
gregsharp geocities com
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]