RE: GLib: wide-character gregex?

From: "Boyd, Todd M." <tmboyd1 ccis edu>
To: "Tor Lillqvist" <tml iki fi>
Cc: gtk-list gnome org
Subject: RE: GLib: wide-character gregex?
Date: Mon, 9 Jun 2008 10:33:16 -0500

> -----Original Message-----
> From: tlillqvist gmail com [mailto:tlillqvist gmail com] On Behalf Of
> Tor Lillqvist
> Sent: Monday, June 09, 2008 10:24 AM
> To: Boyd, Todd M.
> Cc: gtk-list gnome org
> Subject: Re: GLib: wide-character gregex?
> 
> > Is there a regex package in GLib that is capable of
> searching/matching wide
> > characters?
> 
> No. GLib's string APIs (except for the explicit wide char conversion
> ones) handle just plain char strings, generally assumed to be UTF-8 in
> cases where it matters. But if you know that a file is in wide
> characters (i.e. UTF-16LE on Windows), then you can use
> g_utf16_to_utf8() to convert its contents to UTF-8 once you have read
> it in (or mapped it into memory).
> 
> > for future reference, I would like to try and track down a wchar_t
> > implementation of regex functions. I was hoping GLib already had
> them, but
> > perhaps I am wrong.
> 
> Wide characters (wchar_t), although per se part of standard C, in
> practise are used mostly in Windows-specific programming. On Unix and
> Linux, especially in free software circles, encoding Unicode as UTF-8
> is the rule, and thus normal string functions and coding conventions
> can be used. (One notable exception is OpenOffice.org, which used
> UTF-16 internally also on Unix. Dunno about Mozilla, for instance.) So
> in software being mainly developed by people using Linux, you seldom
> see wchar_t.
> 
> (Note that the wchar_t type in gcc on Linux is 32 bits, not 16 bits
> like on Windows, so it actually can represent all characters in
> current Unicode. On Windows when you use wchar_t strings you still
> have to take into consideration that some characters will actually
> take a pair of wchar_ts, so in practise the kind of code you end up
> writing doesn't differ significantly from code that handles UTF-8 or
> other variable-length encodings anyway. It is a question of handling
> Unicode characters as 1..4 chars or 1..2 wchar_ts. You can't just
> pretend each wchar_t is a freestanding character, and that wchar_t
> strings can be split at any place with each part being valid.
> Surrogate pairs do exist.)

Thank you for your suggestions. As it is now, I've changed my code to
convert to UTF-8 after reading the file, so that its contents can be
regexed properly. I've done away with regexing the file name altogether,
and I am using strstr() to determine if the extension is one that needs
to be opened.

Thanks for all your help! Hopefully, I can get this bugger compiling
(and running!) in Win32 today. Then, I can use GLib's dynamic structures
to store my data instead of the incredibly inefficient method of
double-directory-traversal I'm using now. ;)


Todd Boyd
Web Programmer

References:
- GLib: wide-character gregex?
  - From: Boyd, Todd M.
- Re: GLib: wide-character gregex?
  - From: Tor Lillqvist

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]