Re: g_utf8_validate() and NUL characters



From: "Behdad Esfahbod", 10/10/2008 10:33

> Freddie Unpenstein wrote:
>> Why not just adopt the old thing of encoding NULLs and other non-UTF-8
>> characters as safe UTF-8 equivelants...?
> Because they are not valid UTF-8? And the moment we give up dealing with
> valid UTF-8 a whole other can of worms opens up.

I am aware of the trouble allowing multiple encodings of a given character can cause. And I'm not suggesting that at all. If you're referring to anything other than that, please expand on that a little.

My assertion here is basically this; ASCII text (defined here as characters 1-127) encode into UTF-8 as-is. Anything else in the 0-255 set is considered binary, and should be encoded in its shortest multi-byte UTF-8 form. No more, and no less. Call it Glib encoding.

I believe, that differs from the UTF-8 specification ONLY in the handling of the NULL byte, but then I've been avoiding dealing with UTF-8 for the most part for exactly this reason. When UTF-8 is a strict issue, I've been using higher-level scripted languages instead, that already deal with it natively. (And I'm not 100% certain, but I think that's essentially what they all do.)

A "convert to UTF-8" function given a UTF-8 input with a 6-byte representation of the character 'A' would store the regular single-byte representation. Likewise, given a 1 or 4-byte representation of NULL, it would store the 2-byte C080 representation. A generic "convert input to Glib" function which takes the input data and its encoding, and produces "UTF-8 for internal use only" (aka Glib encoding here), would assert that rule even for UTF-8 input. Likewise a "convert Glib to output" function, asked to produce UTF-8 output, would convert whatever it's given to it, into STRICT UTF-8 (ie. restore C080's to their one-byte \0 representation). So the rule of thumb would be, "ALWAYS convert EVERYTHING entering or leaving the application". And that's a Good Thing that should be encourages regardless of this issue.

I know it's a bit of a mind-bend from where Glib/GTK is right now with encodings, Glib/GTK developers don't like hearing from us lowly humans, and there's always resistance to change, but specifications often change when needed to meet practical requirements (no one has ever written a 100% perfect specification), and personally, changing the platform and established behaviour (much harder and more dangerous to attempt to do) to suit the UTF-8 specification in this rather trivial issue seems far more wrong than breaking the UTF-8 specification slightly for internal use only. (The key being the "for internal use only", all "convert to UTF-8" functions would still produce the strict interpretation with \0's) It seems furthermore to be more correct in this day and age to bend a rule like this that makes it SAFER by allowing the old NULL-terminated string handling to function, and not force programmers to deal specially with length specifiers, which happens to all too frequently be a great source of coding mistakes. This would also make it easier to migrate, for example, to UTF-16 at some point in time - everything will already be converting between UTF-8 to Glib-8, so transitioning to Glib-16 would be an entirely internal affair.


Fredderic
   Italian Charm Bracelet
Click for fashionable Italian charm bracelets.
Click here for more information
 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]