RE: GLib: wide-character gregex?



> -----Original Message-----
> From: tlillqvist gmail com [mailto:tlillqvist gmail com] On Behalf Of
> Tor Lillqvist
> Sent: Monday, June 09, 2008 10:24 AM
> To: Boyd, Todd M.
> Cc: gtk-list gnome org
> Subject: Re: GLib: wide-character gregex?
> 
> > Is there a regex package in GLib that is capable of
> searching/matching wide
> > characters?
> 
> No. GLib's string APIs (except for the explicit wide char conversion
> ones) handle just plain char strings, generally assumed to be UTF-8 in
> cases where it matters. But if you know that a file is in wide
> characters (i.e. UTF-16LE on Windows), then you can use
> g_utf16_to_utf8() to convert its contents to UTF-8 once you have read
> it in (or mapped it into memory).
> 
> > for future reference, I would like to try and track down a wchar_t
> > implementation of regex functions. I was hoping GLib already had
> them, but
> > perhaps I am wrong.
> 
> Wide characters (wchar_t), although per se part of standard C, in
> practise are used mostly in Windows-specific programming. On Unix and
> Linux, especially in free software circles, encoding Unicode as UTF-8
> is the rule, and thus normal string functions and coding conventions
> can be used. (One notable exception is OpenOffice.org, which used
> UTF-16 internally also on Unix. Dunno about Mozilla, for instance.) So
> in software being mainly developed by people using Linux, you seldom
> see wchar_t.
> 
> (Note that the wchar_t type in gcc on Linux is 32 bits, not 16 bits
> like on Windows, so it actually can represent all characters in
> current Unicode. On Windows when you use wchar_t strings you still
> have to take into consideration that some characters will actually
> take a pair of wchar_ts, so in practise the kind of code you end up
> writing doesn't differ significantly from code that handles UTF-8 or
> other variable-length encodings anyway. It is a question of handling
> Unicode characters as 1..4 chars or 1..2 wchar_ts. You can't just
> pretend each wchar_t is a freestanding character, and that wchar_t
> strings can be split at any place with each part being valid.
> Surrogate pairs do exist.)

Thank you for your suggestions. As it is now, I've changed my code to
convert to UTF-8 after reading the file, so that its contents can be
regexed properly. I've done away with regexing the file name altogether,
and I am using strstr() to determine if the extension is one that needs
to be opened.

Thanks for all your help! Hopefully, I can get this bugger compiling
(and running!) in Win32 today. Then, I can use GLib's dynamic structures
to store my data instead of the incredibly inefficient method of
double-directory-traversal I'm using now. ;)


Todd Boyd
Web Programmer





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]