A: The Unicode Standard does not guarantee that the canonical ordering of a combining character sequence for any particular script is the 'correct' order from a linguistic point of view; the guarantee is that any two canonically equivalent strings will have the same canonical order.
In retrospect, it would have been possible to have assigned combining classes for certain Arabic and Hebrew non-spacing marks (plus characters for a few other scripts) that would have done a better job of making a canonically ordered sequence reflect linguistic order or traditional spelling orders for such sequences. However, retinkerings at this point would conflict with stability guarantees made by the Unicode Standard when normalization was specified, and cannot be done now. [KW]
</end quote>I'm was trying to figure out why backspace does not delete the last character (accent) in the buffer when entering Hebrew text with accents, and I stumbled upon the reason in gtk+/gtk/gtktextbuffer.c:gtk_text_buffer_backspace():
if (backspace_deletes_character)
{
gchar *normalized_text = g_utf8_normalize (cluster_text,
strlen (cluster_text),
G_NORMALIZE_NFD);
glong len = g_utf8_strlen (normalized_text, -1);
if (len > 1)
gtk_text_buffer_insert_interactive (buffer,
&start,
normalized_text,
g_utf8_offset_to_pointer (normaliz
ed_text, len - 1) - normalized_text,
default_editable);
g_free (normalized_text);
}
And there's the crux. Why the normalization through the call g_utf8_normalize()? If backspace should not simply delete the last character in the buffer, shouldn't its behavior be language dependent, perhaps as part of the pango language module? In any case for Hebrew the current behavior is not logical as there are accents that imo tie stronger than other. E.g. when inserting:
U+5D1 Hebrew Letter Bet
U+05BC Hebrew Point Dagesh or Mapiq
U+05B8 Hebrew Point Qamats
the dotting of the BET (Mapiq) logically ties stronger than the vowel mark Qamats (to such an extent that fonts often provide a different special glyph for the combination Bet/Mapiq), but backspace currently first erases the Mapiq. The reason is probably that Mapiq has a higher unicode code point than the Qamats... This e.g. breaks the open type table Bet/Mapiq ligature as the characters are no longer adjacent. Of course one may build more sophisticated opentype tables, but this seems quite roundabout...
Regards,
Dov