Re: Performance implications of GRegex structure

From: Owen Taylor <otaylor redhat com>
To: Yevgen Muntyan <muntyan tamu edu>
Cc: Gtk+ Developers <gtk-devel-list gnome org>
Subject: Re: Performance implications of GRegex structure
Date: Sat, 17 Mar 2007 17:37:44 -0400

On Sat, 2007-03-17 at 15:45 -0500, Yevgen Muntyan wrote:
> Owen Taylor wrote:
[...]

> > If we can identify the most common patterns of usage, I think we can
> > add convenience functions that make usage of an immutable pattern object
> > almost as convenient as the current GRegex.
> >
> > You can have functions like:
> >
> >  if (g_regex_matches(regex, str, -1, 0)) 
> >     ...
> >
> >  if (g_regex_get_matches(regex, str, -1, 0,
> >                          0, &whole_match,
> >                          1, &first_substring,
> >                          -1)
> >     ...
> >
> >  if (g_regex_get_named_matches(regex, str, -1, 0,
> >                                "firstName", &first_name,
> >                                "lastName", &last_name,
> >                                NULL)
> >     ...
> >
> > The first two cover 98% of all cases when I've ever used a regular
> > expression ... I either want a boolean match / doesn't match, or I
> > want to match against a pattern, and if succeeds, do something with
> > several substrings.

>  It won't cover usage of EggRegex in gtksourceview. The second variant 
>  seems to be nice for "usual" uses, while the third is not - if your
>  named pattern didn't match you get NULL and if whole regex didn't
>  match, you get NULL too. You really want to match, get to know if
>  whole thing matched, and then look at subpatterns or whatnot.

I didn't provide API docs for my examples! :-) Anyways, my intent 
was that the third example was just like the second example, but for
the funky (?Pname) named subpatterns. In both cases the boolean
return value is whether the match succeeded.

I haven't looked at the GtkSourceView code, but my assumption is that
there are only a few places in the code where it is creating regular
expressions, since it's regular expressions are configured in files.
So, adding a few extra lines of code in those places isn't a big deal,
and it doesn't seem to me like the best example of what we need
on the convenience end of things.

It's probably an excellent example of what is needed for performance.

> > That to me, would relegate the matcher object to cases where the
> > annoyance of an extra object is small compared to the complexity of
> > the operation.
> >   
> > You could also take the above functions and have the same thing for:
> >
> >  - Strings (like the current _simple() convenience functions)
> >  - Something like my GStaticRegex proposal
> >
> > As always, the question about convenience functions is "where do you
> > stop?"...

>  Right here, I guess. Let me stress: it's not about *convenience
>  functions*. It's about convenience in using non-simple GRegex API.

Maybe I don't understand your concern. Obviously GRegex needs
to work well for complex uses, but if I have 50 lines of code
manipulating a single regular expression, then changing that to
52 lines of code isn't a big deal.

>  Perhaps it's just that I already have adapted code to changes in
>  EggRegex, not once, and I naturally don't want to do it once again,
>  because some people are used to some stuff in Java...

You are probably half-joking here, but I'll answer it anyways:
once it's in a stable release of GLib, it's in there forever. This is
our only chance to get the API right.

>  To me here the only good argument in favor of separate Match objects is
>  multi-thread uses. Simply because we already have Match object, just
>  hidden. If the best way to fix GRegex for multi-threading is a
>  separate match object, then it should be a separate match object. The
>  rest is really philosophy - if one thinks separate object in code
>  makes it something different conceptually, then he's wrong (it does
>  make API less convenient to use though).

When you evaluate an API, you have to look at a number of things:

 - Is the API complete? Can it do what is needed
 - Does the API allow getting common things done in a few lines of code?
 - Is the API easy to figure out?
 - Is the resulting code legible and easy to read?
 - Does the API encourage writing efficient and correct code?

That last element is an important one; you can't ignore the psychology
of the person using your API.

>  A separate Match*er* object, which would actually have functionality of
>  current GRegex, is not a good idea, since it would only add an extra
>  object without any change in functionality, in particular it would not
>  be thread-safe (some_get_matcher() or some_new_matcher() would be
>  simply equivalent to current g_regex_copy()).

As I demonstrated earlier, g_regex_copy() *does* provide a way of using
GRegex in a thread safe manner, but it's unintuitive and a little
clumsy. I think we can do better than that.

					- Owen

Attachment: signature.asc
Description: This is a digitally signed message part

References:
- Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Marco Barisione
- Re: Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Yevgen Muntyan

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]