OT: On ISO decisions [codes and charsets] (was: Re: locale for Uzbekistan)

From: Danilo Segan <dsegan gmx net>
To: "Andrew W. Nosenko" <awn bcs zp ua>
Cc: gnome-i18n gnome org
Subject: OT: On ISO decisions [codes and charsets] (was: Re: locale for Uzbekistan)
Date: Sat, 27 Sep 2003 01:37:19 +0200

петак, 26. септембар 2003. 22:29:43 CEST — Andrew W. Nosenko написа:
> 
> Just one more point why I hate ISO.

I think ISO is great in this particular case -- it provides you a  
*meaningful* code (that is, it's distinct) for such things like  
different scripts for same language.

Character sets have (almost) nothing to do with the script. It crucial  
that I can differentiate between the *codes* (yes, they are *codes*,  
not "words to be pronounced"). They are technical entities, in a sense  
that we want machines to use them programmaticaly. Sure, it helps if  
they remind us of their true meanings, and that purpose is fullfilled  
too (you're not going to tell me how "Cyrl" doesn't remind you of  
"cyrillic", ain't you? :-)

> 
> One time he are think that iso-8859-5 should be used... (Question: is
> at least one cyrillyc specking people exists that uses this brain- 
> dameged encoding?)

Sorry to dissappoint you, but it is quite used in "Unix systems" (don't  
tell those OpenGroup folks for misappropriate use of the name ;-) for  
Serbian language, where UTF-8 is not supported ;-)

It's also quite common to come across mail communication in ISO-8859-5  
between those speaking Serbian. Okay, I admit, there's no real standard  
for Serbian, so that's why UTF-8 is getting good acceptance in here.

Btw, ISO-8859-5 is no more "brain-damaged" encoding for cyrillic than  
Unicode is, since the latter is mapping it one-to-one from 0xa0--0xff  
to 0x400--0x45f, or something like that (not really sure on the exact  
numbers of characters present, which implies the starting character  
0xa0).

Also, if you think ISO-8859-5 is braindamaged, you should take a look  
at 7-bit YUSCII encoding which encodes some letters over "[", "]" and  
similar characters ;-)

The main shortcoming of ISO-8859-5, as I see it, is that it is not in  
correct "collating" order for Serbian (which also means that Unicode is  
not either), but as UCA[*] proves, it's possible to make it work while  
keeping the collating sequence correct for Russian and other cyrillic  
languages.

Actually, I don't see any advantage to KOI8-* encodings, especially  
since "striping the high bit and getting readable text" is not very  
needed in most modern software.

> 
> Now these guys think that we should crack own tongue and brains (in
> physioligy sence) by reading his totally unliteracy abbreviations...
> 
> Cyrillyc, not Cyrllyc!!!
>    ^ note this `i'!!!
> Therefore, possible abbreviations are `Cyr' or `Cyril' or `Cyrill'  
> but nat a `Cyrl' anyway!!!

Btw, I'd mention that it is "cyrillic" in English :-)

So, we've got four letter codes (sorry, I don't know why was it decided  
on 4-letter codes, but lets accept that as a fact; why one would later  
want *all* the codes to be of the *same* length is quite obvious, at  
least I hope), and we've got to describe Cyrillic script with it.

We have a choice: Cyri or Cyrl. To me, it's more clear that "Cyrl" is  
cyrillic (the first one might be pronounced like "syrai", which has  
hardly any resemblance with "cyrillic"), and even more so since some  
vowels can usually be excluded, and the word will still remind us of  
the original.

Actually, there are several similar rules in Serbian language for  
constructing abbreviations from full names. One of the rules is to take  
a couple of first consonants and construct an abbreviation from them.  
That means that even "Crl." (I don't know if "y" is vowel in English or  
not, if it's not it would be "Cyrl" itself ;-) could be legitimate.

Perhaps English has similar rules which allow that usage, and don't  
forget that "cyrillic" is English word.

Btw, I don't see any particular reason for "hating" ISO because of this  
-- it's still easier to remember than some cryptic code (eg. 0xf642 for  
cyrillic, 0xf4a7 for latin, etc.)


Cheers,
Danilo

[*] Unicode Collation Algorithm, Unicode Technical Report 10, I believe

References:
- Re: locale for Uzbekistan
  - From: Owen Taylor
- Re: locale for Uzbekistan
  - From: Ulrich Drepper
- Re: locale for Uzbekistan
  - From: Owen Taylor
- Re: locale for Uzbekistan
  - From: Andrew W. Nosenko

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]