Re: G_UTF8String: Boxed Type Proposal

From: "Jasper St. Pierre" <jstpierre mecheye net>
To: Matthias Clasen <matthias clasen gmail com>
Cc: gtk-devel-list <gtk-devel-list gnome org>
Subject: Re: G_UTF8String: Boxed Type Proposal
Date: Thu, 17 Mar 2016 13:09:19 -0700

The major issue is that "Unicode character" doesn't have a good
definition. The most likely definition is a "Unicode code point",
however, Windows uses "Unicode character" to mean a UTF-16 byte
sequence, which means that any code point above the Basic Multilingual
Plane is really composed of two "Unicode characters", which are, of
course, surrogate pairs.

This confusion also extends to JavaScript, which composes its String
type of "characters" which are actually UTF-16 values. You can see
this with astral plane characters like emoji:

"💩".length

"💩" == "\uD83D\uDCA9"

true

As an example of a grapheme cluster without a precomposed,
single-code-point form, look at the Regional Indicators, which were
the politics-free way to add flag symbols to the Emoji block. There
are 26 code points, "A" through "Z", and when put next to each other
in language codes, like "🇺🇸", it's expected that certain
combinations will show up as flags, without explicitly defining which
one. But a sequence of regional indicator code points is entirely one
grapheme cluster.

Go drops the term "character" or "code point" entirely and opts for
"rune" instead, which is just a 32-bit value.

Swift has an even crazier "Character" type [0], which can hold an
entire Grapheme Cluster, rather than just a single code-point. This
actually means that Swift's "Character" type is of potentially
infinite length, since Regional Indicators aren't capped at a maximum
of two code points.

Unicode is fun.

[0] 
https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285

On Thu, Mar 17, 2016 at 12:42 PM, Matthias Clasen
<matthias clasen gmail com> wrote:

On Thu, Mar 17, 2016 at 2:26 PM, Jasper St. Pierre
<jstpierre mecheye net> wrote:

I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.


I don't think there is any confusion in glib about this, really.
There is no mention of graphemes in GLib at all, its all just
characters. If you want graphemes, you need pango.




-- 
  Jasper

Follow-Ups:
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen

References:
- G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen
- Re: G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Jasper St. Pierre
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]