Re: Performance implications of GRegex structure
- From: Owen Taylor <otaylor redhat com>
- To: Yevgen Muntyan <muntyan tamu edu>
- Cc: GTK+ development mailing list <gtk-devel-list gnome org>
- Subject: Re: Performance implications of GRegex structure
- Date: Fri, 16 Mar 2007 08:28:26 -0400
On Thu, 2007-03-15 at 14:16 -0500, Yevgen Muntyan wrote:
> [Owen, I apologize, I hit Reply instead of Reply All]
>
> Owen Taylor wrote:
> > So, the regular expression code has been committed to CVS finally. Yay!
> >
> > But looking over the header file, there is something that puzzles me
> > about the way that it's set up: there is no distinction between a
> > "pattern/regular expression" object and a match/matcher object.
> ...
> > Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
> > by having regular expression literals.) While I have never actually done
> > timings on the matter, I've always assumed that the reason that regular
> > expression API's are set up this way is compiling a regular expression
> > has a significant expense.
>
> EggRegex contains two structures internally - Pattern and Match.
> egg_regex_copy() references Pattern and creates new Match structure.
> If you got a copy of EggRegex object using egg_regex_copy(), you
> can use it in another thread. Don't know how convenient it is in
> multi-thread setup. The idea was that it's inconvenient to create
> some match objects on heap.
If people will forgive a code-verbose mail, I want to show some examples
of how things look in practice.
So, start off with a simple function written with the current API:
char *
get_leading_digits(const char *str)
{
GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
char *result = NULL;
if (g_regex_match(str, 0))
result = g_regex_fetch(regex, 0);
g_regex_free(regex);
return result;
}
Short, simple, thread-safe, but inherently slow. We could write
the same thing with an API with an explicit match object:
char *
get_leading_digits(const char *str)
{
GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
GMatcher *matcher = g_matcher_new(regex, str, 0);
char *result = NULL;
if (g_matcher_matches(matcher))
result = g_matcher_get(regex, 0);
g_matcher_unref(matcher);
g_regex_free(regex);
return result;
}
It gets two lines longer and slightly more cumbersome, but it's not
tremendously worse. But really, repeatedly compiling a regular
expression is a bad idea - something we don't want to encourage.So, the
simplest optimization of the original code looks something like:
char *
get_leading_digits(const char *str)
{
static GRegex *regex = NULL;
char *result = NULL;
if (!regex)
regex = g_regex_new("^\\d+", 0, 0, NULL);
if (g_regex_match(str, 0))
result = g_regex_fetch(regex, 0);
return result;
}
That code bothers me a fair bit; not because so much because it's
not thread safe, but because it exhibits a pattern that is *inherently*
not thread safe or re-entrant. People should have built-in reflexes
against writing such code. I'm much happier with the alternative:
char *
get_leading_digits(const char *str)
{
static GRegex *regex = NULL;
GMatcher *matcher;
char *result = NULL;
if (!regex)
regex = g_regex_new("^\\d+", 0, NULL);
matcher = g_matcher_new(regex, str, 0);
if (g_matcher_matches(matcher))
result = g_matcher_get(regex, 0);
g_matcher_free(matcher);
return result;
}
Even though it's longer and no more thread safe. It's not thread
safe, but can be made so without changing the basic pattern:
gpointer
get_leading_digits_regex(gpointer data)
{
return g_regex_new("^\\d+", 0, NULL);
}
char *
get_leading_digits(const char *str)
{
static GOnce once = G_ONCE_INIT;
GRegex *regex = g_once(&once, get_leading_digits_regex, NULL);
GMatcher *matcher;
char *result = NULL;
matcher = g_matcher_new(regex, str, 0);
if (g_matcher_matches(matcher))
result = g_matcher_get(regex, 0);
g_matcher_free(matcher);
return result;
}
As Yevgen points out, we can do the same thing with the current GRegex
API, but I don't think the result is very readable. It isn't code that
"says what it does".
gpointer
get_leading_digits_regex(gpointer data)
{
return g_regex_new("^\\d+", 0, NULL);
}
char *
get_leading_digits(const char *str)
{
static GOnce once = G_ONCE_INIT;
GRegex *template = g_once(&once, get_leading_digits_regex, NULL);
GRegex *regex;
char *result = NULL;
regex = g_regex_copy(template);
if (g_regex_match(str, 0))
result = g_regex_fetch(regex, 0);
g_regex_free(regex);
return result;
}
Now, a regular expression that you want compile once and keep in a
static variable is actually the most common use of a regular
expression. So, maybe the right version of the function actually
looks like:
char *
get_leading_digits(const char *str)
{
GStaticRegex regex = G_STATIC_REGEX_INIT("^\\d+", 0);
GMatcher *matcher;
char *result = NULL;
matcher = g_matcher_new_static(®ex, str, 0);
if (g_matcher_matches(matcher))
result = g_matcher_get(regex, 0);
g_matcher_free(matcher);
return result;
}
I'm not going to argue that the current GRegex API is unworkable,
but I think it obscures the nature of the system - first you compile a
regular expression, then you match against it - and that's going to make
it harder for people to write correct, efficient code.
- Owen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]