Re: Followup: opinions on Search services

From: Manuel Amador <rudd-o amautacorp com>
To: Behdad Esfahbod <behdad cs toronto edu>
Cc: gnome-devel-list gnome org, Jamie McCracken <jamiemcc blueyonder co uk>
Subject: Re: Followup: opinions on Search services
Date: Tue, 26 Apr 2005 18:20:46 -0500

El jue, 07-04-2005 a las 14:56 -0400, Behdad Esfahbod escribió:
> 
> I really hate replying in this thread, but since nobody shed
> light on abilities of Lucene, I try to do that.  Hopefully people
> stop talking about what they don't know.
> 
> On Wed, 6 Apr 2005, Jamie McCracken wrote:
> 
> > 2) Use of an SQL database is a far superior, faster and flexible
> > solution to using a dedicated indexer like the lucerne engine (all other
> > competing engines like spotlight use sql databases). This is one area
> > search services has got right.
> 
> Lucene is a decent search engine.  You cannot compare it with SQL
> databases, you can compare it to another search engine, that may
> or may not use SQL databases as backend, but, as soon as you are
> talking about search engines, their implementation details
> doesn't matter at all.  So, SearchServices by using SQL databases
> is really losing here, since it has to do a lot to catch Lucene,
> that I doubt it can.

We use the MATCH statement of MySQL.  We really do not need that much
more =).

>   SQL databases are good things if you want
> atomicity, transactions, scalability, support for (really)
> complicated queries: joins, subqueries, etc.

Which we use heavily.

>   None of which is
> needed at all in a Desktop search service that you have one
> single server per user that does the indexing too.

We have only one server for all users.  We thus dramatically reduce the
load on multiuser systems.

>   What SQL
> databases provide for a search engine is at best the "like"
> operator and well, they can use indexes when you are matching the
> beginning of the string.  And all the RDBMS hype comes from
> decent products like PostgreSQL, not a toy size one like SQLite.

MySQL has match =).

> 
> Lucene on the other hand, comes out from an experience
> ex-employer of altavista, and from the Apache Foundation.  It's
> specialized for search services.  It allows for localization of
> search technology:  You have an English normalizer, a German one,
> a Persian one, ....  Yes, you have text normalizers there.

That is very good, I grant you that.  But there's no way we could
intersect Lucene results with our own metadata search results, such as:

- documents created by rudd-o the last month that contain the word
"paloma" and were written in plain text

... unless we could move completely to Lucene, which you appear to have
done.  By the way, does Lucene have pybindings?

Plus Search services stores the document information on a separate
object oriented database, for future use and retrieval.  The other
unexplored side of Search services is the possibility that client apps
request metadata from the Search services server, instead of having to
link to metadata extraction libraries and program that *in the apps*.
And it's all an XML-RPC call away.


> 
> Very good point.  Yes, Lucene accepts metadata too.  You can have
> an unlimited number of fields.  In fact, Lucene is quite like a
> relational database, you have different tables, each table has a
> number of fields.  Just that you are not forced to have a primary
> key.  At search time, you can search a table, any field of it,
> with exact or fuzzy matching.  Queries can be built in a tree
> like fashion, by using AND, OR, and NOT operations.  And it
> already has parsers for parsing Google like queries.  It even
> accepts wildcards in query words.  It also accepts quotation for
> searching phrases exactly, something that's a nightmare doing
> with RDBMS-based systems.

Oh.  Then I'll start looking at it!

> 
> 
> I had an experience with Lucene a couple years ago.
> (http://rira.ir/)  I was working on a smallsized database of
> Persian poetry, some 700'000 verses in 17'000 poems.  I had it
> imported in PostgreSQL, in some ten tables.  I wanted to add a
> search service.  Using a table for word-item matchings was out of
> question.  I got Lucene and it was a matter of couple hours to
> write an indexer to fetch data out of PostgreSQL and import into
> Lucene.  Now some of my observations were really stunning:
> 
>   * Data was getting out of PostgreSQL views, which were simply
> natural join of some six tables (poet, book, part, poem, block,
> verse), all indexed, etc.  Database was tuned up to my best of
> knowledge (shared memory size, vacuumed, etc).  Lucene and the
> indexer were running on another maching.  The indexing got just
> under one minute, with the PostgreSQL server making it's machine
> just unusable in this period, perhaps writing join tables on hard
> disk and fetching back later, etc, while the Lucene machine was
> as happy as a machine can be.
> 
>   * The raw SQL dump of the data was 45MiB, Bzip2 would reduce to
> 17MiB.  The PostgreSQL database to hold this data takes more than
> 70MiB, not talking about aa indexing system on top of that.  In
> Lucene, for each field you can select at index time whether you
> like this field to be stored in the database (to be returned at
> search time) or not.  I could simply store primary keys to my
> RDBMS database, but decided to store the whole text in Lucene,
> since after all stored AND indexed, the database as a small 30MiB
> file!! and my search page didn't need to contact the RDBMS for
> serach excerpts anymore.
> 
>   * For a small project like mine, that didn't need almost any of
> RDBMS's glories --or to be honest it needs, but the performance
> of joins I like is not satisfying at all--, I may decide to move
> completely to Lucene.  It provides all I want, and at least
> fetching number of rows is far cheaper than in PostgreSQL for
> example.  (Don't argue about MySQL and others, they barely have
> things like views, schemas, etc.)
> 
> 
> Cheers,
> --behdad
> http://behdad.org/
> _______________________________________________
> gnome-devel-list mailing list
> gnome-devel-list gnome org
> http://mail.gnome.org/mailman/listinfo/gnome-devel-list
-- 
Manuel Amador <rudd-o amautacorp com>
Amauta

Follow-Ups:
- Re: Followup: opinions on Search services
  - From: Behdad Esfahbod
- Re: Followup: opinions on Search services
  - From: Joe Shaw

References:
- Followup: opinions on Search services
  - From: Manuel Amador
- Re: Followup: opinions on Search services
  - From: Joe Shaw
- Re: Followup: opinions on Search services
  - From: Jamie McCracken
- Re: Followup: opinions on Search services
  - From: Behdad Esfahbod

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]