Re: Followup: opinions on Search services

From: Behdad Esfahbod <behdad cs toronto edu>
To: Jamie McCracken <jamiemcc blueyonder co uk>
Cc: gnome-devel-list gnome org
Subject: Re: Followup: opinions on Search services
Date: Thu, 7 Apr 2005 14:56:56 -0400 (EDT)

I really hate replying in this thread, but since nobody shed
light on abilities of Lucene, I try to do that.  Hopefully people
stop talking about what they don't know.

On Wed, 6 Apr 2005, Jamie McCracken wrote:

> 2) Use of an SQL database is a far superior, faster and flexible
> solution to using a dedicated indexer like the lucerne engine (all other
> competing engines like spotlight use sql databases). This is one area
> search services has got right.

Lucene is a decent search engine.  You cannot compare it with SQL
databases, you can compare it to another search engine, that may
or may not use SQL databases as backend, but, as soon as you are
talking about search engines, their implementation details
doesn't matter at all.  So, SearchServices by using SQL databases
is really losing here, since it has to do a lot to catch Lucene,
that I doubt it can.  SQL databases are good things if you want
atomicity, transactions, scalability, support for (really)
complicated queries: joins, subqueries, etc.  None of which is
needed at all in a Desktop search service that you have one
single server per user that does the indexing too.  What SQL
databases provide for a search engine is at best the "like"
operator and well, they can use indexes when you are matching the
beginning of the string.  And all the RDBMS hype comes from
decent products like PostgreSQL, not a toy size one like SQLite.

Lucene on the other hand, comes out from an experience
ex-employer of altavista, and from the Apache Foundation.  It's
specialized for search services.  It allows for localization of
search technology:  You have an English normalizer, a German one,
a Persian one, ....  Yes, you have text normalizers there.

> Cause its not just about indexing - We have metadata too and
> that really needs a DB. If all you want is a google on your
> hard drive then yes a dedicated indexer would be best but an
> RDBMS will give you expanidbility and flebility in handling
> structured metadata with more powerful search options.

Very good point.  Yes, Lucene accepts metadata too.  You can have
an unlimited number of fields.  In fact, Lucene is quite like a
relational database, you have different tables, each table has a
number of fields.  Just that you are not forced to have a primary
key.  At search time, you can search a table, any field of it,
with exact or fuzzy matching.  Queries can be built in a tree
like fashion, by using AND, OR, and NOT operations.  And it
already has parsers for parsing Google like queries.  It even
accepts wildcards in query words.  It also accepts quotation for
searching phrases exactly, something that's a nightmare doing
with RDBMS-based systems.

I had an experience with Lucene a couple years ago.
(http://rira.ir/)  I was working on a smallsized database of
Persian poetry, some 700'000 verses in 17'000 poems.  I had it
imported in PostgreSQL, in some ten tables.  I wanted to add a
search service.  Using a table for word-item matchings was out of
question.  I got Lucene and it was a matter of couple hours to
write an indexer to fetch data out of PostgreSQL and import into
Lucene.  Now some of my observations were really stunning:

  * Data was getting out of PostgreSQL views, which were simply
natural join of some six tables (poet, book, part, poem, block,
verse), all indexed, etc.  Database was tuned up to my best of
knowledge (shared memory size, vacuumed, etc).  Lucene and the
indexer were running on another maching.  The indexing got just
under one minute, with the PostgreSQL server making it's machine
just unusable in this period, perhaps writing join tables on hard
disk and fetching back later, etc, while the Lucene machine was
as happy as a machine can be.

  * The raw SQL dump of the data was 45MiB, Bzip2 would reduce to
17MiB.  The PostgreSQL database to hold this data takes more than
70MiB, not talking about aa indexing system on top of that.  In
Lucene, for each field you can select at index time whether you
like this field to be stored in the database (to be returned at
search time) or not.  I could simply store primary keys to my
RDBMS database, but decided to store the whole text in Lucene,
since after all stored AND indexed, the database as a small 30MiB
file!! and my search page didn't need to contact the RDBMS for
serach excerpts anymore.

  * For a small project like mine, that didn't need almost any of
RDBMS's glories --or to be honest it needs, but the performance
of joins I like is not satisfying at all--, I may decide to move
completely to Lucene.  It provides all I want, and at least
fetching number of rows is far cheaper than in PostgreSQL for
example.  (Don't argue about MySQL and others, they barely have
things like views, schemas, etc.)

Cheers,
--behdad
http://behdad.org/

Follow-Ups:
- Re: Followup: opinions on Search services
  - From: Manuel Amador

References:
- Followup: opinions on Search services
  - From: Manuel Amador
- Re: Followup: opinions on Search services
  - From: Joe Shaw
- Re: Followup: opinions on Search services
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]