Re: UCS-2 in gunicode.h



-> I think we're just going to add a function like this to glib:
-> 
-> gchar*
-> g_convert (const gchar *str,
->            gint         len,
->            const gchar *to_codeset,
->            const gchar *from_codeset,
->            gint        *bytes_converted)

	So a g_ wrapper around iconv, then.  Will all the g_utf8_*()
functions currently in gunicode.h disappear, then?

	To answer my original question:  libiconv stores everything
internally was a wide character (wchar_t).  Then, when returning converted
strings, it puts the encoding into the appropriate width (8-bit, 16-bit,
or 32-bit).

	For UCS-2, here is the function that converts the internal wchar_t
UCS-2 string into the 16-bit output string:

[ From ucs2.h in libiconv: ]

static int
ucs2_wctomb (conv_t conv, unsigned char *r, wchar_t wc, int n)
{
  if (wc < 0x10000 && wc != 0xfffe) {
    if (n >= 2) {
      r[0] = (unsigned char) (wc >> 8);
      r[1] = (unsigned char) wc;
      return 2;    
    } else
      return RET_TOOSMALL;
  } else  
    return RET_ILSEQ;
}


	This looks to me like any 32-bit Unicode character--that is, one
which will not exist in the UCS-2 space--will result in a "RET_ILSEQ"
return value. 

	The function iconv() uses this return value to note that the
conversion has failed.  It will then try several fallbacks for the
conversion of the character: First, a U+303E-prefixed variant, then
transliteration, and finally it gives up and converts the entire character
into "Undefined", Unicode char FFFD.

	So, in summary, if you tried to convert a UTF-8 string into a
UCS-2 string, and that UTF-8 string had the multi-byte encoding of a
32-bit Unicode character, the conversion would succeed but the 32-bit
character would be replaced with the UCS-2 encoding of the "Undefined"
character.  All in all, a very graceful solution if you ask me.

	(It would be cool if Pango could draw a cute little "Don't Panic"
icon for FFFD :) )


--Derek

P.S.> I found the iconv code somewhat hard to follow, with lots of tall
nested blocks, multiple gotos, #defines of return values and then not
using those #defines values in the error-checking switch statements (i.e.
magic numbers), and variable names like "ap", "bp", and "cp".  Not at all
like the Glib code.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]