Re: strcasecmp/tolower/toupper breakage

On Fri, May 04, 2001 at 10:30:01AM -0400, Havoc Pennington wrote:
> Indeed. pcre and the Python engine are supposedly adding UTF-8
> support, but if you're using the POSIX regexp stuff...
Yes -- with perl 5.6 you can do,
    use utf8;
and then Perl regular expressions are in terms of characters, not bytes.

There are character classes for whitespace, lower case, upper, etc, too,
using the Proprty mechanism \p, e.g. \p{Lu} for upercase, \p{InTibetan},
etc.  See perldoc perlunicode or more details.

The perl utf8 support does not interact well with the locale mechanism,
and locale information is only applied to characters in the range 0..255;
this is wrong in many cases, and it's suggested that you don't do both
"use locale" and "use utf8" in the same program.

Long term, I hope the Unix-like world will migrate to 32-bit characters
(despite the overhead), to the idea of multilingual multinational
software which doesn't have a single fixed "locale", and to losing
the idea that a single input character maps to a single output glyph
in all cases.  Many programs are at least part of the way there now --
e.g. pango handling ligatures and combining characters.

There's no reason why a document in a word processing tool can't have
pargraphs in there languages, or a spreadsheet with Hebrew, French
and American English intermix numerical presentation conventions.

That is a vision.

But we are not there yet :)


Liam Quin - Barefoot in Toronto - liam holoweb net -
Author, Open Source XML Database Toolkit, Wiley August 2000
Co-author: The XML Specification Guide, Wiley 1999; Mastering XML, Sybex 2001

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]