Building off Medusa



Summary:

  - Dual-licensing is the only Medusa issue that I don't think can be
resolved.
  - There are important reasons why Medusa runs in user-space. If you
are really determined to do a system daemon, Medusa was already built to
do this and that aspect of Medusa could be revived. If you have further
questions why Medusa was steered down the path its currently on, I'd
love to discuss this with you further.
  - Do not underestimate the number of issues and the amount of work to
recreate something like Medusa. It seems simple on the surface, doing it
well is very complex. I do not see even recreating Medusa as feasible in
the scope of a college project (even a big one).
  - Medusa is a pretty clean codebase and would be relatively easy to
extend and change.
  - If you choose not to use Medusa and can "deliver", I'd certainely be
in favour of modifying gnome-search-tool to use your system.

> * we couldn't dual-license products based on medusa - GPL.  we do intend 
> to GPL our work, but we will dual-license it as well, à là Qt.

That would be a serious issue for you in using medusa, I agree. Most of
the Medusa IP is not owned by a specific set of programmers, but by a
dead company....who knows who owns the IP now. That means there's no
possibility of relicencsing Medusa (probably wouldn't be looked upon
favorably anyway).

> * documentation for medusa is ZERO

Medusa isn't documented, but I would be happy to help you understand how
the code works. It *is* very well written code though it sadly follows
in the sorry Unix/C/GNOME tradition of not commenting the code much.

> * it's written in C, making development slow and making it hard to get 
> people around here to work on it

I can appreciate not wanting to use C. My concern if you didn't use C
would be ensuring that a C API was provided so we could integrate it
with GNOME applications. C is, for the most part, the "common
denominator" language on *nix.

> * the implementation is per-user, instead of being per-system.  that 
> means several medusa indexers and several indexes, instead of one master 
> index.

1) Medusa has not always been per-user (in fact, no released version of
Medusa has been per-user). My point is: a lot of work was done on Medusa
to verify that it was secure, make it work well as a system daemon
communicating with user processes, etc, and we still backed down from a
sytem daemon in the end after all that investment. Don't underestimate
the work involved on this point.

2) System indexes have a lot of scary security problems. You (or,
perhaps more pointedly, the Linux distributions you want to run your
indexer as "root") have to be confident that there is no way to crash or
confuse your indexer from user created files, file structures, etc. This
becomes a particularly serious issue if you want to have lots of
indexing "plugins" (for example, index the "metadata" from MP3s, AbiWord
documents, etc). Each of these plugins will need to meet that level of
security!

3) While a user is logged in, it is highly desirable to index their data
much more frequently. This is easily accomodated with a user space
daemon but requires tricky (though not impossible) games with a system
daemon.

4) You can't index as anything other than root because many interesting
user documents will not be world readable.

5) If you have a system index made as root, you need to implement a
search daemon that controls access to that information based on the
interested processes UID and the permissions relevant to each indexed
file. Also note that there can be discrepencies in security created in
between permission changes and re-indexes, which could possibly be a
concern on some systems.

The current planned Medusa approach, under consideration, is as follows:

  - Data in /home/username is frequently indexed by a user space daemon.
This is done while the user is logged in.
  - A system index is performed as "nobody", allowing searches for files
and information that everyone has read access to (such as man pages,
documentation, etc).
  - GnomeVFS integration and incremental indexing mean that as soon as a
file is changed the user-space indexing daemon is notified and
re-indexes just that file.
  - User space indexing means it is easy to get information on whether
the mouse and keyboard are in use (something that *was* done with the
system medusa indexer too, but was more tricky) and "back off" to
provide a responsive system.
  - Recently used documents (perhaps an extended version) allows the
medusa user-space indexing daemon to find new areas of the disk where
people keep files that the system indexer wasn't able to access. That
means that even if the files in /Music aren't readable by nobody, if you
access a file in /Music the user space medusa will find that directory
and start indexing it. (This is a touchy point, may not be good, may be,
hard to say)

> * we don't want a hundred PCs indexing the NFS server each.  we want the 
> search service to delegate queries to NFS servers, so as to avoid 
> network load and wasted disk space

Yes, very important. Medusa currently avoids indexing NFS mountpoints,
but doesn't do anything to solve the "searching nfs mounts" problem.
However, there's no reason medusa can't be extended to do this. It will
certainely be easier then starting from scratch.

> * as there is no documentation, we don't know if Medusa can index 
> gigabytes of files, extract their data and metadata, and provide 
> less-than-10 second query response.  Our initial prospect for the 
> database, PostgreSQL, can indeed provide less-than-10 second response 
> for queries, provided the proper indexes are applied to the proper tables.

It would be quite possible to port Medusa to using a database as a
backend, or using database backends as an alternate source of
information. (BTW, you might consider looking at SQLite for the local
index case).

> But if you could help me work through these issues, we would be glad 
> (after all, we'd be saving work) to do this. 
> 
> Trust me, what we want to do is much bigger than just medusa.  We want 
> to bring enterprise-class full text indexing and search to Linux, *and* 
> open-source it.  We also will be looking into data mining, to provide 
> document maps and the like.  This all when the basic technology is ready.

I'm assuming "enterprise-class" here is a euphamism for "networked".
Actually, my biggest concern with your project if you started from
scratch would be two other enterprise concerns: security and
reliability.

Medusa *does* do full text indexing at the moment, but it has no concept
of "other medusa indexes". This would be a substantial project, but MUCH
MUCH less than doing it from scratch.

I don't want to discourage you, and I don't know about your abilities
etc, but I would be very surprised if it was even possible to achieve
Medusa's level of stability and features in the scope of a college
project (even a very big one). There are a lot of subtle (and tricky)
issues within Medusa that have been addressed. Do not underestimate the
work involved. BUT, there is a lot of room for improvement of Medusa and
I think you could add many of the important features that you think are
missing without too much problem.

All this said, if you want to try and go ahead with doing a project
"from scratch", and you do create a project that is faster than medusa,
at least as reliable and secure, and has the nice extra features you
propose, I would certainely be in favour of modifying gnome-search-tool
to use it. Note that "secure" means it would have to be secure enough
that you can convince Linux distributions and Solaris to ship it. That
will be a *major* issue if you have your indexer run as root, so be
warned!

-Seth




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]