Re: Building off Medusa
- From: "Manuel Amador (Rudd-O)" <amadorm usm edu ec>
- To: Seth Nickell <snickell stanford edu>
- Cc: sinzui cox net, desktop-devel-list gnome org, gnome-devel-list gnome org
- Subject: Re: Building off Medusa
- Date: Thu, 10 Apr 2003 09:02:23 -0500
Just so you know... Incremental indexing won't be possible using libfam.
FAM will not scale to monitoring over about 500 files, so you definitely
will not be able to get change notification on all the files on a disk.
I would love a way to register with the kernel to be notified whenever
*any* file changes, but I don't believe there is such a mechanism.
Oh, but it does, and it does well (at least on Linux and Solaris). I
just got a command-line tool which reports on stdout all files changed.
You run it with a directory or file as the sole argument, and it spits
out all modified files' paths, half a second after they're modified.
Fam does *not* monitor *each* file. I don't know how it does, but it
doesn't watch every file. If it would, I would agree with you.
One concern I would have using a system indexer is that only system
installed services will have their metadata indexed. For example, it
means that if I install a new Word processor in my account (or on my
local machine, though in some system setups I wouldn't have system
access to my local machine), my word processor documents on the NFS
server (or local machine) will not get indexed with their special
metadata.
As long as the server (NFS or local) has the indexing plugins installed,
it will index the documents, whether you do have the application
installed in the client computer or not. We'd recommend ISVs to write
plugins which may use their application libraries, but keep them
independent from them. You would also be able to see some metadata
(document title, topics, etcetera) in the search interface.
We are definitely not planning to create a search service that is
per-user-installable. It will be an all-or-nothing proposition.
Remember we're aiming for the corporate customer with our enterprise
edition and the general linux distro with the open-source release. We
will also attempt to have it be enabled by default. This is a
reasonable and sensible conclusion of the concepts of security and ease
of use embedded in our product. We also plan to be extra careful and
have the OSS community audit the code so we can nail all potential
security problems ASAP.
That would help, but as you say will only alleviate the problem.
Evidently. The other cornerstone is what we call "abort on faulty data,
no questions asked". Let me explain a bit:
The "root exposure" threat model is limited to three components:
1) The file monitor (indirectly via FAM)
2) The indexing plugins (directly to the files)
3) The indexing service (indirectly via the indexing plugins)
(we plan to make the search service a user-level process, although in
principle it will be written with the same security pragmas)
Any of those can turn into a monster by faulty data. Now let's analyze
each of them. The file monitor gets data via FAM. FAM only passes
pathnames to the file monitor. In consequence, we have no other choice
but trust the operating system, and the pathnames, because the OS is the
source of those pathnames. The only caveat is that we need to deal with
each file. Since minimal procecssing is involved in this pathname (only
matching them against an exclusion list, that's all), the window of
exposure is small, and the data is mostly trustable.
The indexing plugins are the ones with the largest window of exposure.
They will be directly exposed to faulty data. In consequence, the
plugin has the duty to reject anything that is not valid data.
Manipulating data is easier and less error-prone with high-level
languages - that also reduces our window of exposure. We know that,
even after all these security precautions are taken, it's possible that
a plugin is compromised. We take it seriously. But I'm sure that the
value provided by our solution is much higher than the value lost to the
potential security threat, and that the security threat might be
efficiently controlled (at least for our corporate customers, for which
we might provide managed updates thru software distribution channels and
the like, but OSS users will also have updates available). If we
weren't sure of this, we would simply dive into YAWVBA (yet another
windows/visual basic application).
The indexer would also be written in a high-level language. Since the
indexer would only need to parse a standardized XML dialect, the parser
could simply throw an exception on faulty data events, and the indexer
would drop the ball immediately. There are no buffer overflows possible.
We recognize that bad-logic-induced defects could appear in our
product. We will be preemptive and proactive.
I doubt text-based XML messages will be a good option for communication
between say a system indexer and specific indexing components. When you
consider the number of document types that could be on a system, this
could represent a major performance bottleneck.
Perhaps if we were to XMLize full-text-indexing searches, I'd agree.
But in practice, data indexing plugins wouldn't XML data. Only metadata
indexing plugins would. Why? Because there's less of a performance
impact. The security, component-discreteness and convenience advantages
of parsing XML largely outstrip the performance gain derived from a
CORBA/COM/RPC/stdout-based solution. And the indexing speed is
secondary. Primary thing for the user experience is the searching speed.
Doing comp0letely live updated "indexed" searches for a whole filesystem
requires a filesystem that supports it. None of the commonly deployed
filesystems (that I know of) currently do. FAM, as I mentioned before,
won't be able to do this.
I showed you that FAM can do this. Trust me, I was amazed, it can. It
apparently doesn't monitor each file, but uses dnotify to receive global
events and deliver them as pathnames to the application. On systems
which don't support /dev/imon or dnotify, we have a serious problem (FAM
would attempt to poll() or stat() *each* file, which most certainly
would kill the system).
Not on Unix, at least AFAIK. Disk activity is unfortunately not (to my
knowledge) monitorable, nor does it seem to be adequately handled by
having a low priority. As far as we could tell with medusa, the indexer
would hog the system wrt to disk activity no matter how low the priority
was set.
Why? Let me answer my own question. Linux 2.4 allowed for applications
to hog the disk, even on low priority events. But for 2.6 Linux, the
disk elevator algorithms combined with the low-latency efforts and the
speculative whatnot *do* improve system responsiveness under high disk
throughput, and prevent applications from hogging the disk transfer
bandwidth. This will make "nice -20 indexd" possible. I'm also sure
that Solaris can cope with this too. That leaves the BSDs. And the
BSDs are champ in performance as well.
Besides, it's our idea that this kind of problem is the architecture's
fault and any efforts to solve it should be directed towards the
underlying architecture, not into building a workaround. Systemic
thought: any system can be effectively changed, it's just a matter of
finding the leverage point.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]