Re: Building off Medusa




Just so you know... Incremental indexing won't be possible using libfam.
FAM will not scale to monitoring over about 500 files, so you definitely
will not be able to get change notification on all the files on a disk.
I would love a way to register with the kernel to be notified whenever
*any* file changes, but I don't believe there is such a mechanism.

Oh, but it does, and it does well (at least on Linux and Solaris). I just got a command-line tool which reports on stdout all files changed. You run it with a directory or file as the sole argument, and it spits out all modified files' paths, half a second after they're modified. Fam does *not* monitor *each* file. I don't know how it does, but it doesn't watch every file. If it would, I would agree with you.
One concern I would have using a system indexer is that only system
installed services will have their metadata indexed. For example, it
means that if I install a new Word processor in my account (or on my
local machine, though in some system setups I wouldn't have system
access to my local machine), my word processor documents on the NFS
server (or local machine) will not get indexed with their special
metadata.

As long as the server (NFS or local) has the indexing plugins installed, it will index the documents, whether you do have the application installed in the client computer or not. We'd recommend ISVs to write plugins which may use their application libraries, but keep them independent from them. You would also be able to see some metadata (document title, topics, etcetera) in the search interface.
We are definitely not planning to create a search service that is 
per-user-installable.  It will be an all-or-nothing proposition.  
Remember we're aiming for the corporate customer with our enterprise 
edition and the general linux distro with the open-source release.  We 
will also attempt to have it be enabled by default.  This is a 
reasonable and sensible conclusion of the concepts of security and ease 
of use embedded in our product.  We also plan to be extra careful and 
have the OSS community audit the code so we can nail all potential 
security problems ASAP.
That would help, but as you say will only alleviate the problem.

Evidently. The other cornerstone is what we call "abort on faulty data, no questions asked". Let me explain a bit:
The "root exposure" threat model is limited to three components:
1) The file monitor (indirectly via FAM)
2) The indexing plugins (directly to the files)
3) The indexing service (indirectly via the indexing plugins)
(we plan to make the search service a user-level process, although in principle it will be written with the same security pragmas)
Any of those can turn into a monster by faulty data.  Now let's analyze 
each of them.  The file monitor gets data via FAM.  FAM only passes 
pathnames to the file monitor.  In consequence, we have no other choice 
but trust the operating system, and the pathnames, because the OS is the 
source of those pathnames.  The only caveat is that we need to deal with 
each file.  Since minimal procecssing is involved in this pathname (only 
matching them against an exclusion list, that's all), the window of 
exposure is small, and the data is mostly trustable.
The indexing plugins are the ones with the largest window of exposure.  
They will be directly exposed to faulty data.  In consequence, the 
plugin has the duty to reject anything that is not valid data.  
Manipulating data is easier and less error-prone with high-level 
languages - that also reduces our window of exposure.  We know that, 
even after all these security precautions are taken, it's possible that 
a plugin is compromised.  We take it seriously.  But I'm sure that the 
value provided by our solution is much higher than the value lost to the 
potential security threat, and that the security threat might be 
efficiently controlled (at least for our corporate customers, for which 
we might provide managed updates thru software distribution channels and 
the like, but OSS users will also have updates available).  If we 
weren't sure of this, we would simply dive into YAWVBA (yet another 
windows/visual basic application).
The indexer would also be written in a high-level language.  Since the 
indexer would only need to parse a standardized XML dialect, the parser 
could simply throw an exception on faulty data events, and the indexer 
would drop the ball immediately.  There are no buffer overflows possible.
We recognize that bad-logic-induced defects could appear in our 
product.  We will be preemptive and proactive.
I doubt text-based XML messages will be a good option for communication
between say a system indexer and specific indexing components. When you
consider the number of document types that could be on a system, this
could represent a major performance bottleneck.

Perhaps if we were to XMLize full-text-indexing searches, I'd agree. But in practice, data indexing plugins wouldn't XML data. Only metadata indexing plugins would. Why? Because there's less of a performance impact. The security, component-discreteness and convenience advantages of parsing XML largely outstrip the performance gain derived from a CORBA/COM/RPC/stdout-based solution. And the indexing speed is secondary. Primary thing for the user experience is the searching speed.
Doing comp0letely live updated "indexed" searches for a whole filesystem
requires a filesystem that supports it. None of the commonly deployed
filesystems (that I know of) currently do. FAM, as I mentioned before,
won't be able to do this.

I showed you that FAM can do this. Trust me, I was amazed, it can. It apparently doesn't monitor each file, but uses dnotify to receive global events and deliver them as pathnames to the application. On systems which don't support /dev/imon or dnotify, we have a serious problem (FAM would attempt to poll() or stat() *each* file, which most certainly would kill the system).
Not on Unix, at least AFAIK. Disk activity is unfortunately not (to my
knowledge) monitorable, nor does it seem to be adequately handled by
having a low priority. As far as we could tell with medusa, the indexer
would hog the system wrt to disk activity no matter how low the priority
was set.

Why? Let me answer my own question. Linux 2.4 allowed for applications to hog the disk, even on low priority events. But for 2.6 Linux, the disk elevator algorithms combined with the low-latency efforts and the speculative whatnot *do* improve system responsiveness under high disk throughput, and prevent applications from hogging the disk transfer bandwidth. This will make "nice -20 indexd" possible. I'm also sure that Solaris can cope with this too. That leaves the BSDs. And the BSDs are champ in performance as well.
Besides, it's our idea that this kind of problem is the architecture's 
fault and any efforts to solve it should be directed towards the 
underlying architecture, not into building a workaround.  Systemic 
thought: any system can be effectively changed, it's just a matter of 
finding the leverage point.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]