Re: Performance implications of GRegex structure



Owen Taylor wrote:
On Fri, 2007-03-16 at 21:30 +0100, Marco Barisione wrote:
Il giorno gio, 15/03/2007 alle 10.18 -0400, Owen Taylor ha scritto:
But looking over the header file, there is something that puzzles me
about the way that it's set up: there is no distinction between a
"pattern/regular expression" object and a match/matcher object.
The internal code in GRegex was deeply modified but the API is quite
similar to the original one written by Scott Wimer and then modified by
Matthias Clasen, so I kept a single GRegex object but with lots of
doubts.

In the end I decided to keep a single object because I prefer this
approach when using languages without a garbage collector and because
QRegExp (the equivalent object in QT) is a single object.

This matter was brought out in the mailing list and in bugzilla but only
Havoc Pennington and Yevgen Muntyan expressed their opinion saying that
they prefer a single object.

I apologize for not speaking up on the bugzilla bug. I must admit that
though I saw the discussion, I didn't really pay a lot of attention
until the header file appeared in CVS.

I certainly appreciate the arguments for convenience in C; it's a valid
concern. But I don't think we should let convenience be the overriding
factor over everything else; after all, the user *is* writing in C,
so convenience almost certainly wasn't utmost on their mind ;-)

If he uses GRegex instead of raw pcre, then one could say it *is* about convenience ;)
If we can identify the most common patterns of usage, I think we can
add convenience functions that make usage of an immutable pattern object
almost as convenient as the current GRegex.

You can have functions like:

if (g_regex_matches(regex, str, -1, 0)) ...

 if (g_regex_get_matches(regex, str, -1, 0,
                         0, &whole_match,
                         1, &first_substring,
                         -1)
    ...

 if (g_regex_get_named_matches(regex, str, -1, 0,
                               "firstName", &first_name,
                               "lastName", &last_name,
                               NULL)
    ...

The first two cover 98% of all cases when I've ever used a regular
expression ... I either want a boolean match / doesn't match, or I
want to match against a pattern, and if succeeds, do something with
several substrings.
It won't cover usage of EggRegex in gtksourceview. The second variant seems to be nice for "usual" uses, while the third is not - if your named pattern didn't match you get NULL and if whole regex didn't match, you get NULL too. You really want to match, get to know
if whole thing matched, and then look at subpatterns or whatnot.
That to me, would relegate the matcher object to cases where the
annoyance of an extra object is small compared to the complexity of
the operation.
You could also take the above functions and have the same thing for:

 - Strings (like the current _simple() convenience functions)
 - Something like my GStaticRegex proposal

As always, the question about convenience functions is "where do you
stop?"...
Right here, I guess. Let me stress: it's not about *convenience functions*. It's about convenience in using non-simple GRegex API. Perhaps it's just that I already have adapted code to changes in EggRegex, not once, and I naturally don't want to do it once again, because some people are used
to some stuff in Java...

To me here the only good argument in favor of separate Match objects is multi-thread uses. Simply because we already have Match object, just hidden. If the best way to fix GRegex for multi-threading is a separate match object, then it should be a separate match object. The rest is really philosophy - if one thinks separate object in code makes it something different conceptually, then he's wrong (it does make API less convenient to use though).

A separate Match*er* object, which would actually have functionality of current GRegex, is not a good idea, since it would only add an extra object without any change in functionality, in particular it would not be thread-safe (some_get_matcher() or some_new_matcher()
would be simply equivalent to current g_regex_copy()).

Best regards,
Yevgen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]