Re: g_utf8_validate() and NUL characters



Hey all,

I recent ran across this situation.
The simple fact is that NUL (character 0) (also: not NULL which is a pointer)
is nowhere stated to be an invalid unicode character
in the unicode spec (g_unichar_validate(0) returns TRUE btw),
and the UTF-8 spec doesn't prohibit 0, and following its wording literally,
unicode char 0 transforms to a single byte 0.

Nonetheless, I think g_utf8_validate() should be kept as is,
at least for a long time.  It is misnamed, but it serves such
a useful purpose that it is widely deployed.

I think it should have been named g_utf8_validate_string()
b/c that's a more accurate name.  I think it's fair to
say that strings are NUL-terminated in C (e.g. str* functions
and string literals) but there's no standard saying what a string is,
so who knows.

The simple fact is that MOST strings in structures, param-lists etc in C
are simply:
   char *name;
not
   guint name_len;
   char *name;
so, you definitely want a function like g_utf8_validate_string()
to ensure that a string doesn't contain NUL in a situation
that it actually cannot be used.

It would be nice if a g_utf8_validate_data (const char *str,
                                            gsize       size,
					    GError    **error)
could be added...  it should follow the UTF-8 spec permitting character 0.

Perhaps g_utf8_validate_string() could be added (identical to current
g_utf8_validate() or maybe removing the size param,
and possibly deprecating that function as confusing).
But replacing it with the new semantics should probably wait a long time.

-------

This is all rather tangential, I believe to the
original problem with gedit.  It should do it's own UTF-8
validation, b/c a text editor likes to handle invalid
UTF-8 specially.  UTF-8 is a spec that will not change,
and is about 10 lines of code; you can afford to include your own version.
It should do something smarter first-off
to handle other encodings ie detect Latin1, obey locale, etc etc.
And it could default to markup like <red>HEX</red> for non-UTF8 bytes.
That's a lot different that the handling you want from say, 
a configuration parser.

- dave


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]