Re: Performance implications of GRegex structure



On Thu, Mar 15, 2007 at 10:56:57AM -0400, Owen Taylor wrote:
> The  compiled form of a regular expression is not altered during matching, 
> so the same compiled pattern can safely be used by several threads at once.
> ...
> Well, I could imagine (maybe, barely) that someone could show me numbers
> that showed that with a variety of long and complicated regular
> expressions, compiling them was still 10x as fast as matching them
> against very short strings.

To answer Owen - I expect this is because the base regcomp()/regexec()
libraries to not make this distinction. To emulate the higher
performing libraries that separate the Pattern from the Matcher would
require jumping through some hoops.

There are two cases I see. One is multithreaded scaleability. If this
was impotant, simulation for these older libraries could be performed
using a pool of pre-compiled regular expression objects. For example,
if "give me a new matcher object" would pull the compiled regular
expression from the pool, or if none is available, compile a new one,
and once complete, it would return the regular expression to the
pool. At some point, it would reach a steady state where new
compilation was not required. I expect it would begin to line up with
the number of threads using it.

The second case is ability to re-use a compiled pattern from the same
thread. I believe this is possible using the provided interface, although
the freedom to use more than one Matcher at the same time might be
convenient.

To illustrate the cost of compile-every-time vs compile-once (19X slower!):

Using the regcomp()/regexec() that comes with my FC6 system with
compile each time:

-- CUT --
$ cat r.c
#include <sys/types.h>
#include <regex.h>

int main ()
{
    regex_t regex;
    int i;

    for (i = 0; i < 1000000; i++) {
        regcomp(&regex, "constant", 0);
        regexec(&regex, "text that contains constant somewhere", 0, 0, 0);
        regfree(&regex);
    }

    return 0;
}

$ gcc -O3 -o r r.c

$ time ./r
./r  15.04s user 0.04s system 99% cpu 15.223 total
-- CUT --

Using the regcomp()/regexec() that comes with my FC6 system with
compile once:

-- CUT --
$ cat r2.c
#include <sys/types.h>
#include <regex.h>

int main ()
{
    regex_t regex;
    int i;

    regcomp(&regex, "constant", 0);
    for (i = 0; i < 1000000; i++) {
        regexec(&regex, "text that contains constant somewhere", 0, 0, 0);
    }
    regfree(&regex);

    return 0;
}
$ gcc -O3 -o r2 r2.c
$ time ./r2
./r2  0.77s user 0.00s system 100% cpu 0.773 total
-- CUT --

-- 
mark mielke cc / markm ncf ca / markm nortel com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]