Re: strcasecmp/tolower/toupper breakage

Alan Cox <alan redhat com> writes:
> > utf8_strcasecmp() is pretty easy to implement using unichar_tolower(),
> > if you don't change its behavior according to locale.
> The traditional utf8 'oh my god' appears to be regexps...

Indeed. pcre and the Python engine are supposedly adding UTF-8
support, but if you're using the POSIX regexp stuff...

There are UTF-8 safe expressions (e.g. "^\n" or something) but there
are also expressions that will pull in half of a UTF-8
character. Oops. Also, most uses of regexp (such as breaking on
whitespace) are broken from an i18n standpoint (you can't just break
lines on whitespace, you have to use the Unicode line break

Basically the C library is all stuck in the locale model. The locale
model was a good stopgap solution, but it's fundamentally broken for a
variety of reasons. Ergo the end goal here is to stop using any of the
C library text handling, more or less.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]