Re: Performance implications of GRegex structure

From: "Nikolai Weibull" <now bitwi se>
To: "Mathieu Lacage" <Mathieu Lacage sophia inria fr>
Cc: gtk-devel-list gtk org, Owen Taylor <otaylor redhat com>, Morten Welinder <mortenw gnome org>
Subject: Re: Performance implications of GRegex structure
Date: Fri, 16 Mar 2007 09:40:35 +0100

On 3/16/07, Mathieu Lacage <Mathieu Lacage sophia inria fr> wrote:

On Thu, 2007-03-15 at 10:56 -0400, Owen Taylor wrote:

> Well, I could imagine (maybe, barely) that someone could show me numbers
> that showed that with a variety of long and complicated regular
> expressions, compiling them was still 10x as fast as matching them
> against very short strings.
>
> But in general, yes, part of my concern is that there are situations
> where you are going to matching the same regular expression against
> thousands of strings, and in that situation, unless compilation is very,
> very, fast, the need to repeatedly recompile will inevitably produce
> measurable overhead.

If this were to happen, could you not just put together a per-thread
5-entry double hash table to cache the 5 most recently used regex
strings ? It really seems like a no-brainer. Am I missing something ?


Now it's beginning to get out of hand.  What's conceptually so
wrong/difficult with having one object for the regex and one object
for the matches generated when matching this regex against some input?
It's how basically every other library/language does it.  It's easy
to understand and easy to use.

Here follows a way to academic description for why it doesn't make
sense to keep the matches with the regex.

Here's a diagram of how pattern matching is done at a very abstract level:

      Pattern
         |
         v
Pattern Matcher Generator
         |
         v
Input->Matcher->Yes/No

The Pattern is the regular expression string, for example, "a*b".  The
Pattern Matcher Generator is what turns the Pattern into the finite
automaton Matcher that recognizes the (not always) regular language
that is described by the regular expression.  Input is the string that
we want to determine if it is part of the regular language that is
described by the regular expression and accepted by the finite
automaton.  Running a finite control for the finite automaton and the
input will yield a Yes or No answer.

We can substitute that Yes/No answer with a Matches object that tells
us more information than simply if the Input matches the Pattern or
not.

In no way does it make sense to keep the answer in the Matcher.


Marke Mielke wrote:

To answer Owen - I expect this is because the base regcomp()/regexec()
libraries to not make this distinction.


Sure they do.  The function regcomp() returns an opaque type regex_t,
while regexec() returns an array of transparent type regmatch_t.

On the whole multithreaded issue, I feel that that's beside the point.
And, besides, the PCRE documentation seems clear about this issue:

The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.

  (pcreapi(3) Manual Page)

 nikolai

References:
- Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Morten Welinder
- Re: Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Mathieu Lacage

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]