Re: Building off Medusa



> >  - Do not underestimate the number of issues and the amount of work to
> >recreate something like Medusa. It seems simple on the surface, doing it
> >well is very complex. I do not see even recreating Medusa as feasible in
> >the scope of a college project (even a big one).
> >
> We wouldn't be recreating medusa.  We see ourselves as integrators.  We 
> would integrate existing software and write/adapt current user 
> interfaces to take advantage of this.  I'm investigating Medusa as a 
> possibility for an index/search service.  I'm investigating Xapian too.  
> The thing is, I agree completely with you, if we were to "redo medusa 
> all over in C".  but we won't.  We don't have the manpower and we would 
> fail the course.

Great. I was worried you were being unrealistic. I can see now you are
being realistic.

> Incidentally, the course is focused on providing business management 
> solutions.  Doing a medusa isn't one, so we need to provide a complete 
> end-user corporate solution *and* sell it to at least one customer.

Right.

> The incremental live indexing isn't difficult.  Medusa could use libfam 
> to gather modifications to files, and reindex them.  I got it pretty 
> sorted out, although perhaps a "medusa-modifyd" is needed, which puts 
> filesystem change manifests in a FIFO queue, and medusa reads the queue 
> and reindexes the filesystem.  This also helps for offline searches.  
> The multiuser thing is crucial to a business setting.  The metadata 
> indexing is crucial for me (damnit i want to find my MP3).

Just so you know... Incremental indexing won't be possible using libfam.
FAM will not scale to monitoring over about 500 files, so you definitely
will not be able to get change notification on all the files on a disk.
I would love a way to register with the kernel to be notified whenever
*any* file changes, but I don't believe there is such a mechanism.

One concern I would have using a system indexer is that only system
installed services will have their metadata indexed. For example, it
means that if I install a new Word processor in my account (or on my
local machine, though in some system setups I wouldn't have system
access to my local machine), my word processor documents on the NFS
server (or local machine) will not get indexed with their special
metadata.

> >2) System indexes have a lot of scary security problems. You (or,
> >perhaps more pointedly, the Linux distributions you want to run your
> >indexer as "root") have to be confident that there is no way to crash or
> >confuse your indexer from user created files, file structures, etc. This
> >becomes a particularly serious issue if you want to have lots of
> >indexing "plugins" (for example, index the "metadata" from MP3s, AbiWord
> >documents, etc). Each of these plugins will need to meet that level of
> >security!
> >
> This can be alleviated:
> * indexing plugins should be written in high-level, managed languages 
> (python?).  Exceptions should be caught and the program aborted.

That would help, but as you say will only alleviate the problem.

> * communication among components should use XML.  That way the parsers 
> can throw exceptions and the communication can be aborted before any 
> damage is done.

I doubt text-based XML messages will be a good option for communication
between say a system indexer and specific indexing components. When you
consider the number of document types that could be on a system, this
could represent a major performance bottleneck.

> I fully agree that data should be indexed quickly.  But why for logged 
> users?  Why not for all of them?  It's not that hard.   Files modified, 
> and a couple of seconds later the index reindexes them, all with the 
> help of FAM and perhaps a separate application (a file monitor queueing 
> service, which could also be a systemwide service, no seecurity risk in 
> that because it couldn't be polluted by malicious data).  Key here is 
> that the index runs with nice -20, so no system performance impact.

Doing comp0letely live updated "indexed" searches for a whole filesystem
requires a filesystem that supports it. None of the commonly deployed
filesystems (that I know of) currently do. FAM, as I mentioned before,
won't be able to do this.

> >  - User space indexing means it is easy to get information on whether
> >the mouse and keyboard are in use (something that *was* done with the
> >system medusa indexer too, but was more tricky) and "back off" to
> >provide a responsive system.
> >
> You don't need to monitor for user activity.  Merely setting a very low 
> priority makes for a responsive system.  The Microsoft Indexing service 
> follows this approach.

Not on Unix, at least AFAIK. Disk activity is unfortunately not (to my
knowledge) monitorable, nor does it seem to be adequately handled by
having a low priority. As far as we could tell with medusa, the indexer
would hog the system wrt to disk activity no matter how low the priority
was set.

> >I'm assuming "enterprise-class" here is a euphamism for "networked".
> >
> plus sellable for lots of bucks.

*grin*

-Seth




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]