Re: Performance implications of GRegex structure

From: Yevgen Muntyan <muntyan tamu edu>
To: Owen Taylor <otaylor redhat com>
Cc: Gtk+ Developers <gtk-devel-list gnome org>
Subject: Re: Performance implications of GRegex structure
Date: Sat, 17 Mar 2007 15:45:39 -0500

Owen Taylor wrote:

On Fri, 2007-03-16 at 21:30 +0100, Marco Barisione wrote:

Il giorno gio, 15/03/2007 alle 10.18 -0400, Owen Taylor ha scritto:

But looking over the header file, there is something that puzzles me
about the way that it's set up: there is no distinction between a
"pattern/regular expression" object and a match/matcher object.

The internal code in GRegex was deeply modified but the API is quite
similar to the original one written by Scott Wimer and then modified by
Matthias Clasen, so I kept a single GRegex object but with lots of
doubts.

In the end I decided to keep a single object because I prefer this
approach when using languages without a garbage collector and because
QRegExp (the equivalent object in QT) is a single object.

This matter was brought out in the mailing list and in bugzilla but only
Havoc Pennington and Yevgen Muntyan expressed their opinion saying that
they prefer a single object.


I apologize for not speaking up on the bugzilla bug. I must admit that
though I saw the discussion, I didn't really pay a lot of attention
until the header file appeared in CVS.

I certainly appreciate the arguments for convenience in C; it's a valid
concern. But I don't think we should let convenience be the overriding
factor over everything else; after all, the user *is* writing in C,
so convenience almost certainly wasn't utmost on their mind ;-)

If he uses GRegex instead of raw pcre, then one could say it *is* aboutconvenience ;)

If we can identify the most common patterns of usage, I think we can
add convenience functions that make usage of an immutable pattern object
almost as convenient as the current GRegex.

You can have functions like:

if (g_regex_matches(regex, str, -1, 0))...


 if (g_regex_get_matches(regex, str, -1, 0,
                         0, &whole_match,
                         1, &first_substring,
                         -1)
    ...

 if (g_regex_get_named_matches(regex, str, -1, 0,
                               "firstName", &first_name,
                               "lastName", &last_name,
                               NULL)
    ...

The first two cover 98% of all cases when I've ever used a regular
expression ... I either want a boolean match / doesn't match, or I
want to match against a pattern, and if succeeds, do something with
several substrings.

It won't cover usage of EggRegex in gtksourceview. The second variantseems to be nicefor "usual" uses, while the third is not - if your named pattern didn'tmatch you get NULLand if whole regex didn't match, you get NULL too. You really want tomatch, get to know

if whole thing matched, and then look at subpatterns or whatnot.

That to me, would relegate the matcher object to cases where the
annoyance of an extra object is small compared to the complexity of
the operation.

You could also take the above functions and have the same thing for:


 - Strings (like the current _simple() convenience functions)
 - Something like my GStaticRegex proposal

As always, the question about convenience functions is "where do you
stop?"...

Right here, I guess. Let me stress: it's not about *conveniencefunctions*. It's about conveniencein using non-simple GRegex API. Perhaps it's just that I already haveadapted code to changes inEggRegex, not once, and I naturally don't want to do it once again,because some people are used

to some stuff in Java...

To me here the only good argument in favor of separate Match objects ismulti-thread uses.Simply because we already have Match object, just hidden. If the bestway to fix GRegexfor multi-threading is a separate match object, then it should be aseparate match object.The rest is really philosophy - if one thinks separate object in codemakes it something differentconceptually, then he's wrong (it does make API less convenient to usethough).

A separate Match*er* object, which would actually have functionality ofcurrent GRegex,is not a good idea, since it would only add an extra object without anychange in functionality,in particular it would not be thread-safe (some_get_matcher() orsome_new_matcher()

would be simply equivalent to current g_regex_copy()).

Best regards,
Yevgen

Follow-Ups:
- Re: Performance implications of GRegex structure
  - From: Yevgen Muntyan
- Re: Performance implications of GRegex structure
  - From: Owen Taylor

References:
- Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Marco Barisione
- Re: Performance implications of GRegex structure
  - From: Owen Taylor

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]