Re: G_UTF8String: Boxed Type Proposal



On 17/03/16 20:29, Matthias Clasen wrote:
Terminology can certainly be confusing at times, but I think that a
Unicode character is a perfectly well-defined entity, non-withstanding
the fact that it can be represented in various encodings (a utf8
sequence, a ucs4 word, a utf-16 surrogate pair, etc).

You mean a code point, then (that's basically what gunichar is). I think
the reason Unicode people are so pedantic about "code point" is because
a code point may or may not be what you actually mean when you say
"character", whereas it's rare that I see "code point" used with a
meaning other than its Unicode one.

More precisely, a Unicode code point is an abstract entity indexed by a
number, such as U+0041 LATIN CAPITAL LETTER A or U+262D HAMMER AND
SICKLE, which can only be concretely represented as some particular byte
sequence by passing it through an encoding like UCS-4, UTF-8 or
ISO-8859-1. Some encodings are more obvious than others, and in
particular non-Unicode encodings like ISO-8859-1 cannot represent every
Unicode code point.

-- 
Simon McVittie
Collabora Ltd. <http://www.collabora.com/>



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]