Re: g_utf8_validate() and NUL characters



coda wrote:
> I ended up here after pursuing the "invalid character coding" behavior of gedit.
>  gedit tries to convert a file to UTF-8 using g_convert, which always succeeds
> when converting from an 8-bit encoding like ISO-8859-1. The converted string
> contents could contain a NUL, since that's the canonical representation of
> U+0000 NULL, a valid character. However, gedit must call g_utf8_validate() on
> the contents to make sure that GTK+ widgets will accept the string, and
> g_utf8_validate() does not consider a NUL character valid. As a result of all
> this, gedit's inability to edit "binary files" is simply an inability to edit a
> file with a NUL byte in it.

Have you tried to work around GTK+'s issue using a loop to skip over the NUL's
to see if there are other issues?

> The bug in gedit is here, with a rather poor patch that lets the file be opened
> but corrupts it if saved: http://bugzilla.gnome.org/show_bug.cgi?id=156199

Note that while your approach to convert from ISO-8859-1 to UTF-8 "works",
then entering UTF-8 text into that file and trying to save does not work.  So
it's an either text or binary approach to editing.  A truly useful editing
mode is to be able to open a file with mixed UTF-8 text and binary data, edit
the UTF-8 text, and save.

> I discussed this on #gtk+ with mathrick and pbor and it seems that the
> assumption that UTF-8 strings are NUL-terminated and contain no NULs runs pretty
> deep. A possible solution is to use "modified UTF-8" (
> http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents U+0000 as
> the two-byte sequence 0xC0 0x80, normally illegal in standard UTF-8, but using
> the normal decoding algorithm, represents U+0000.

I believe the NUL bytes are the smallest of your problems, and can be fixed in
GTK+.

> I filed a bug on g_utf8_validate() here:
> http://bugzilla.gnome.org/show_bug.cgi?id=555285
> 
> g_utf8_validate() could simply be fixed to accept NUL characters,

Yes, that's what I prefer too.  Many glib functions take a length argument,
but its interpretation varies across different functions significantly.  Most
interpret it as:

  - If -1, str is nul-terminated.  Otherwise length is the *maximum* length in
bytes of str.

The problematic part is the "max" there.  That disallows nul bytes in str even
if a length is provided.  The reasoning for this I've heard from Owen is to
allow slicing a prefix of a string.  Say, for example, "at most 20 bytes".
However, that approach is inherently incompatible with UTF-8 text.  One can't
simply take 20 bytes at the start of the string and hope that it would be
valid UTF-8.

A saner interpretation would be:

  - If -1, str is nul-terminated.  Otherwise length is the length in bytes of str.

And there's g_utf8_validate() that interprets it as:

  - If -1, str is nul-terminated.  Otherwise, length is the length in bytes of
str.  Str should not be nul-terminated in the first length bytes.

Ugh.  Why is that?  Who knows?  Matthias suggested that because a string
claiming to be length bytes long but terminating prematurely is not valid.
However, that statement assumes that string is nul-terminated.


So yeah, it's all a mess.  I like to somehow clean the mess, but it may have
to wait till glib 3.0...  I don't think the implications of the changes will
be very catastrophic, but can't know without extensively going over all uses
in all projects...  Some Google Code voodoo may help us get a rough feeling of
the odds.

> but functions
> that return a gchar* with no length output parameter, like 
> gtk_text_buffer_get_text(), would require replacements.

Yes.

> Another possibility mentioned was making more use of GString.

Not a huge fan.

> Is there any reason not to support NUL/U+0000 in strings?

None that I know of, and I've been trying to fix this in Pango.


behdad



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]