Re: Building off Medusa
- From: "Manuel Amador (Rudd-O)" <amadorm usm edu ec>
- To: Seth Nickell <snickell stanford edu>
- Cc: sinzui cox net, desktop-devel-list gnome org, gnome-devel-list gnome org
- Subject: Re: Building off Medusa
- Date: Thu, 03 Apr 2003 18:31:22 -0500
Seth Nickell wrote:
- There are important reasons why Medusa runs in user-space.
Security being one of them. But medusa in its current incarnation,
simply isn't scalable without a lot of sysadmin effort (which rules out
broad implementantions and could hamper GNOME as a desktop platform).
Imagine a 50-client NFS server, being simultaneously indexed by the
medusas of each logged in user in each of the clients. Now imagine the
sysadmins chalking it up to Medusa/GNOME.
If you
are really determined to do a system daemon, Medusa was already built to
do this and that aspect of Medusa could be revived. If you have further
questions why Medusa was steered down the path its currently on, I'd
love to discuss this with you further.
Thanks =) The reasons for the scale-down were discussed a couple months
ago.
- Do not underestimate the number of issues and the amount of work to
recreate something like Medusa. It seems simple on the surface, doing it
well is very complex. I do not see even recreating Medusa as feasible in
the scope of a college project (even a big one).
We wouldn't be recreating medusa. We see ourselves as integrators. We
would integrate existing software and write/adapt current user
interfaces to take advantage of this. I'm investigating Medusa as a
possibility for an index/search service. I'm investigating Xapian too.
The thing is, I agree completely with you, if we were to "redo medusa
all over in C". but we won't. We don't have the manpower and we would
fail the course.
Incidentally, the course is focused on providing business management
solutions. Doing a medusa isn't one, so we need to provide a complete
end-user corporate solution *and* sell it to at least one customer.
- Medusa is a pretty clean codebase and would be relatively easy to
extend and change.
- If you choose not to use Medusa and can "deliver", I'd certainely be
in favour of modifying gnome-search-tool to use your system.
This comment makes me extremely happy. I don't want to displace medusa
at all, though. At this point, I'd like to know what can medusa do,
specifically:
*relevance scores for returned documents: crucial for sorting documents
*full-text search using word stems: people don't really remember the
exact spelling of a word
*full-text search phonetically: ditto
*incremental live indexing: updated indexes to the last half minute
*multiuser indexing: to provide for query returns which filter out what
users can't see
*metadata indexing: to search through files asking for "artist",
"author", or "album" (i'm listening to music =)
*offline searches: to provide "volume indexes" (search your 50 CD-ROMs
without having them mounted)
The incremental live indexing isn't difficult. Medusa could use libfam
to gather modifications to files, and reindex them. I got it pretty
sorted out, although perhaps a "medusa-modifyd" is needed, which puts
filesystem change manifests in a FIFO queue, and medusa reads the queue
and reindexes the filesystem. This also helps for offline searches.
The multiuser thing is crucial to a business setting. The metadata
indexing is crucial for me (damnit i want to find my MP3).
But perhaps most interesting are the key enterprise features: Relayed
queries and rewritten responses. Instead of having the indexer index
the NFS server, do NOT index it. Let the NFS server's index do it.
Then, when a query is received in the search service, the search service
relays the query to the NFS server. Finding out which volumes are
NFS-mounted is dead-easy. A remote client would need that the search
service rewrite its responses' path names (think Windows GUI tool
searching SAMBA server) so the client can open the files.
This requires per-volume tracking. You need to keep track of volumes,
volume labels and files. This also would help the indexer avoid
reindexing a volume when it's remounted somewhere else.
I can appreciate not wanting to use C. My concern if you didn't use C
would be ensuring that a C API was provided so we could integrate it
with GNOME applications. C is, for the most part, the "common
denominator" language on *nix.
Well, you're right. But I assure you that we wouldn't have time to
write a C library to connect to the search service. We intend to write
an XML vocabulary, and let the clients build their queries in that
vocabulary and send them to the search service. We expect people to
link up with the search service in that fashion, and perhaps we would
make example freely licensed code available to ease that integration
work. I think GNOMErs and KDErs won't have a problem, since both
platforms have XML libraries. XML also grants us platform-independence,
zero need to code, extensibility and backwards compatibility.
1) Medusa has not always been per-user (in fact, no released version of
Medusa has been per-user). My point is: a lot of work was done on Medusa
to verify that it was secure, make it work well as a system daemon
communicating with user processes, etc, and we still backed down from a
sytem daemon in the end after all that investment. Don't underestimate
the work involved on this point.
Definitely not. But as explained up, an enterprise knowledge mining
solution can't work per-user. Think of an attorney looking for a
particular contract in the company's file server. Now think of 50
attorneys looking for different documents.
2) System indexes have a lot of scary security problems. You (or,
perhaps more pointedly, the Linux distributions you want to run your
indexer as "root") have to be confident that there is no way to crash or
confuse your indexer from user created files, file structures, etc. This
becomes a particularly serious issue if you want to have lots of
indexing "plugins" (for example, index the "metadata" from MP3s, AbiWord
documents, etc). Each of these plugins will need to meet that level of
security!
This can be alleviated:
* indexing plugins should be written in high-level, managed languages
(python?). Exceptions should be caught and the program aborted.
* communication among components should use XML. That way the parsers
can throw exceptions and the communication can be aborted before any
damage is done.
I know what your fears are, and I fully share them. Malicious users
could injecct malicious files. And if the indexing job were done in C,
I'd be scared shitless. But not so with managed languages. About 80%
of security bugs can be slashed like that. After that, there's the
issue of plugins relaying malicious data to the indexer, but if the
communication is done in XML, malicious data might trigger an exception
in the indexer, and the indexer would mark the plugin as bad, and keep
on strolling.
3) While a user is logged in, it is highly desirable to index their data
much more frequently. This is easily accomodated with a user space
daemon but requires tricky (though not impossible) games with a system
daemon.
I fully agree that data should be indexed quickly. But why for logged
users? Why not for all of them? It's not that hard. Files modified,
and a couple of seconds later the index reindexes them, all with the
help of FAM and perhaps a separate application (a file monitor queueing
service, which could also be a systemwide service, no seecurity risk in
that because it couldn't be polluted by malicious data). Key here is
that the index runs with nice -20, so no system performance impact.
4) You can't index as anything other than root because many interesting
user documents will not be world readable.
Exactly.
5) If you have a system index made as root, you need to implement a
search daemon that controls access to that information based on the
interested processes UID and the permissions relevant to each indexed
file. Also note that there can be discrepencies in security created in
between permission changes and re-indexes, which could possibly be a
concern on some systems.
Yes. We are counting on the need of implementing access control
capabilities in the search daemon. Medusa already had that. About the
permission changes, that can be solved with FAM too. chmod on a file?
reindex the file's metadata and presto.
The current planned Medusa approach, under consideration, is as follows:
- Data in /home/username is frequently indexed by a user space daemon.
This is done while the user is logged in.
- A system index is performed as "nobody", allowing searches for files
and information that everyone has read access to (such as man pages,
documentation, etc).
Except for corporate information that is visible only to members of
group "management" (fictional setting). Then management can't mine that
data.
- GnomeVFS integration and incremental indexing mean that as soon as a
file is changed the user-space indexing daemon is notified and
re-indexes just that file.
- User space indexing means it is easy to get information on whether
the mouse and keyboard are in use (something that *was* done with the
system medusa indexer too, but was more tricky) and "back off" to
provide a responsive system.
You don't need to monitor for user activity. Merely setting a very low
priority makes for a responsive system. The Microsoft Indexing service
follows this approach.
- Recently used documents (perhaps an extended version) allows the
medusa user-space indexing daemon to find new areas of the disk where
people keep files that the system indexer wasn't able to access. That
means that even if the files in /Music aren't readable by nobody, if you
access a file in /Music the user space medusa will find that directory
and start indexing it. (This is a touchy point, may not be good, may be,
hard to say)
* we don't want a hundred PCs indexing the NFS server each. we want the
search service to delegate queries to NFS servers, so as to avoid
network load and wasted disk space
Yes, very important. Medusa currently avoids indexing NFS mountpoints,
but doesn't do anything to solve the "searching nfs mounts" problem.
However, there's no reason medusa can't be extended to do this. It will
certainely be easier then starting from scratch.
* as there is no documentation, we don't know if Medusa can index
gigabytes of files, extract their data and metadata, and provide
less-than-10 second query response. Our initial prospect for the
database, PostgreSQL, can indeed provide less-than-10 second response
for queries, provided the proper indexes are applied to the proper tables.
It would be quite possible to port Medusa to using a database as a
backend, or using database backends as an alternate source of
information. (BTW, you might consider looking at SQLite for the local
index case).
PostgreSQL was our choice. Full text indexing there. But Xapian shapes
up as an amazing contender.
I'm assuming "enterprise-class" here is a euphamism for "networked".
plus sellable for lots of bucks.
=) luck.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]