Re: Followup: opinions on Search services

On Wed, 2005-04-06 at 23:57 +0100, Jamie McCracken wrote:

> EG say I want all doc files created after 1 APril 2005 that contain the 
> word "office"

You'd go straight to the text retrieval database for that.
File type, creation and/or modification date, file size and the
contained words are all needed by most indexing software.

Metadata like "author's name", the document title, the LC subject
classification, associated keywords, the list of people allowed to
edit the document, might all be kept in a big slow clunky RDBMS :-)

> With an SQL DB (EG an embedded one like SQLite or Embedded Firebird) its 
> trivial to do it all in SQL effortlessly and quickly. Conversely, your 
> method would involve getting *all* files which contain the word "office" 
> and then matching that list to the SQL result set - yuck! That couldn't 
> be more slow and inefficient!
Well, no, that's not true.  I can imagine many slower ways.  For example
let's make an Oracle VARCHAR column for "word" and another for
"filename" and...

If you think of SQL queries as easy for users to write, and you think
of RDBMSs as being the most efficient possible search engines, then
yes, you'll think of relational databases as being the right approach
for almost anything.

> btw, OS/X spotlight uses SQLite for both indexing and metadata and I 
> suspect Longhorn will do something similiar with its SQL server as they 
> have now shit canned winFS.
SQL Server has some hybrid technology inside -- I don't know about
SQLite here.  Unfortunately the next SQL Server (Yukon) looks like it
won't have XML Query support, although they've done a lot of the work
and demonstrated it in public.

Anyway, I'm not sure any of this really matters on this list, sorry.

It might be interesting to post the relative size of your index compared
to the document size, and recall/precision graphs if you have them,
and indexing speed, and memory behaviour given, say, 4GBytes of text
to index.  People on Unix tend to expect services to be robust, so
some good tests are in order here.


Liam Quin, W3C XML Activity Lead,
Pictures from old books:
IRC (chat) programs:

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]