On Mon, Jan 26, 2009 at 12:57:28PM -0500, Owen Taylor wrote: > On Mon, 2009-01-26 at 18:30 +0100, Martín Vales wrote: > > Yes, i only talked about the overhead with utf8 outside of glib, only that. > > Perhaps the only solution is add more suport to utf16 in glib with more > > methods. > > There's zero point in talking about a "solution" until you have profile > data indicating that there is a problem. Indeed. UTF-16 is horribly broken by design, and any attempt made to migrate in the direction _towards_ it is a flawed one, and should be avoided. UTF-8 is backward-compatible with the legacy str*() functions in C, which, like it or not, will be around for a while yet. * It makes sure not to embed any ASCII NUL ('\0') in the stream unless it means it, as U+0000, which makes it work with these functions. * UTF-8 has nice properties in substring matches - grep can work on UTF-8 despite not knowing it, because no valid UTF-8 string ever appears falsely as a substring of another. * This also means that the only occurance of '\n' in a UTF-8 stream is a real one. This means that cat, head/tail, awk, etc... can properly detect where the linefeeds are. 'head' can print, say, the first 3 lines of UTF-8 text without knowing it's UTF-8. * UTF-8 can be sorted by only sorting the encoded bytes. sort can sort a UTF-8-encoded text file. The order of the Unicode strings, is the same as the bytewise-sorted order of the raw bytes that encode it. This list goes on. Meanwhile, on the other end of the spectrum, storing Unicode data as decoded 32bit integers makes some sense. It means string indexing operations are constant-width - the substring between the 4th and 9th characters in such an array will be known to lie between the 16th and 36th bytes. The presence of combining characters, and double-width glyphs does make this transformation a bit harder, effectively reducing the advantage such a scheme has. Compared to that, UTF-16 offers NONE of these advantages. UTF-16 cannot be passed through any legacy str*() function, nor will it work in grep, sed, awk, cut, sort, head, tail, or in fact _any_ of the standard UNIX text tools. Nor can UTF-16 be array indexed in constant time, because of the surrogate pairs used to encode codepoints outside of the BMP (Basic Multilingual Plane). In Summary - UTF-16. Don't. Just Don't. -- Paul "LeoNerd" Evans leonerd leonerd org uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Attachment:
signature.asc
Description: Digital signature