Re: strcasecmp/tolower/toupper breakage
- From: Liam Quin <liam holoweb net>
- To: gnome-devel-list gnome org, gnome-hackers gnome org
- Subject: Re: strcasecmp/tolower/toupper breakage
- Date: Fri, 4 May 2001 18:42:32 -0400
On Fri, May 04, 2001 at 10:30:01AM -0400, Havoc Pennington wrote:
> Indeed. pcre and the Python engine are supposedly adding UTF-8
> support, but if you're using the POSIX regexp stuff...
Yes -- with perl 5.6 you can do,
use utf8;
and then Perl regular expressions are in terms of characters, not bytes.
There are character classes for whitespace, lower case, upper, etc, too,
using the Proprty mechanism \p, e.g. \p{Lu} for upercase, \p{InTibetan},
etc. See perldoc perlunicode or more details.
The perl utf8 support does not interact well with the locale mechanism,
and locale information is only applied to characters in the range 0..255;
this is wrong in many cases, and it's suggested that you don't do both
"use locale" and "use utf8" in the same program.
Long term, I hope the Unix-like world will migrate to 32-bit characters
(despite the overhead), to the idea of multilingual multinational
software which doesn't have a single fixed "locale", and to losing
the idea that a single input character maps to a single output glyph
in all cases. Many programs are at least part of the way there now --
e.g. pango handling ligatures and combining characters.
There's no reason why a document in a word processing tool can't have
pargraphs in there languages, or a spreadsheet with Hebrew, French
and American English intermix numerical presentation conventions.
That is a vision.
But we are not there yet :)
Lee
--
Liam Quin - Barefoot in Toronto - liam holoweb net - http://www.holoweb.net/
Ankh: irc.sorcery.net www.valinor.sorcery.net irc.gnome.org www.advogato.org
Author, Open Source XML Database Toolkit, Wiley August 2000
Co-author: The XML Specification Guide, Wiley 1999; Mastering XML, Sybex 2001
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]