Re: spell checking (long, sorry...)
- From: Brian Stafford <brian stafford uklinux net>
- To: balsa-list gnome org
- Subject: Re: spell checking (long, sorry...)
- Date: Fri, 6 Jul 2001 12:18:57 +0100
On Thu, 5 July 20:33 Albrecht Dreß wrote:
> But now for the bad news... for me it does not work for words with "umlauts"
> (german national characters). Looking into src/spell-chek.c, line 1200, I
> found the following rexexp to isolate words:
>
> const gchar *new_word_regex = "\\<[[:alpha:]']*\\>";
>
> Apparently, my glibc implementation (yes LANG/LC_ALL are de_DE.ISO-8859-1)
> does not recognise Umlauts neither in the regexp nor in a call to isalpha().
> Not sure if this changed in glibc 2.2. Changing the expression to
> "\\<[[:alpha:]äöüÄÖÜß']*\\>" helps a little, as most words are now recognised.
> The exception are those *starting* with an Umlaut (like "ähnlich")...
Maybe the RE library is buggy. AFAIK [:alpha:] is supposed to match alphabetic
characters with or without diacritical marks. OTOH [a-z] merely enumerates
the characters between 'a' and 'z'; not quite the same thing.
> An other problem might be the "empty word separator expression" (\< and \>).
> During the discussions about the URL regexp's it emerged that there are
> probabely more people around whose rexexp implementation does not support this
> feature.
Just a thought ... why not use PCRE in Balsa. It has a posix API as well as
its own so no code changes are necessary. RE syntax is the same as perl, so
you can rely on \b as marking a word boundary. Unfortunately its character
class tables are generated at compile time, so it may not solve the [:alpha:]
thing.
> So I guess we should think about rewriting this part of code, and
> maybe replace the regexec stuff by something hardcoded. However, if the
> isalpha implementation was not changed in recent glibc's, then we have the
> problem that we had to hand-code all national character sets... Opinions?
isalpha() and friends is supposed to be affected by the LANG environment.
The same is supposed to be true for [:alpha:]. I suspect a hard coded parser
using isalpha() might have the same problem given the same libc.
Maybe a program *must* call set_locale() for this to happen, can't remember
offhand.
I dislike american spelling on my desktop so I set LANG to en_GB. From
time to time I get irritating and unexpected side effects from this too
compared to the C locale (e.g. sorting drives me mental). Presumably
the Posix committee saw fit to punish the world for having the temerity
not to speak american english.
Regards,
Brian Stafford
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]