Re: Building off Medusa





Just so you know... Incremental indexing won't be possible using libfam.
FAM will not scale to monitoring over about 500 files, so you definitely
will not be able to get change notification on all the files on a disk.
I would love a way to register with the kernel to be notified whenever
*any* file changes, but I don't believe there is such a mechanism.

Oh, but it does, and it does well (at least on Linux and Solaris). I just got a command-line tool which reports on stdout all files changed. You run it with a directory or file as the sole argument, and it spits out all modified files' paths, half a second after they're modified. Fam does *not* monitor *each* file. I don't know how it does, but it doesn't watch every file. If it would, I would agree with you.

One concern I would have using a system indexer is that only system
installed services will have their metadata indexed. For example, it
means that if I install a new Word processor in my account (or on my
local machine, though in some system setups I wouldn't have system
access to my local machine), my word processor documents on the NFS
server (or local machine) will not get indexed with their special
metadata.

As long as the server (NFS or local) has the indexing plugins installed, it will index the documents, whether you do have the application installed in the client computer or not. We'd recommend ISVs to write plugins which may use their application libraries, but keep them independent from them. You would also be able to see some metadata (document title, topics, etcetera) in the search interface.

We are definitely not planning to create a search service that is per-user-installable. It will be an all-or-nothing proposition. Remember we're aiming for the corporate customer with our enterprise edition and the general linux distro with the open-source release. We will also attempt to have it be enabled by default. This is a reasonable and sensible conclusion of the concepts of security and ease of use embedded in our product. We also plan to be extra careful and have the OSS community audit the code so we can nail all potential security problems ASAP.

That would help, but as you say will only alleviate the problem.

Evidently. The other cornerstone is what we call "abort on faulty data, no questions asked". Let me explain a bit:

The "root exposure" threat model is limited to three components:
1) The file monitor (indirectly via FAM)
2) The indexing plugins (directly to the files)
3) The indexing service (indirectly via the indexing plugins)
(we plan to make the search service a user-level process, although in principle it will be written with the same security pragmas)

Any of those can turn into a monster by faulty data. Now let's analyze each of them. The file monitor gets data via FAM. FAM only passes pathnames to the file monitor. In consequence, we have no other choice but trust the operating system, and the pathnames, because the OS is the source of those pathnames. The only caveat is that we need to deal with each file. Since minimal procecssing is involved in this pathname (only matching them against an exclusion list, that's all), the window of exposure is small, and the data is mostly trustable.

The indexing plugins are the ones with the largest window of exposure. They will be directly exposed to faulty data. In consequence, the plugin has the duty to reject anything that is not valid data. Manipulating data is easier and less error-prone with high-level languages - that also reduces our window of exposure. We know that, even after all these security precautions are taken, it's possible that a plugin is compromised. We take it seriously. But I'm sure that the value provided by our solution is much higher than the value lost to the potential security threat, and that the security threat might be efficiently controlled (at least for our corporate customers, for which we might provide managed updates thru software distribution channels and the like, but OSS users will also have updates available). If we weren't sure of this, we would simply dive into YAWVBA (yet another windows/visual basic application).

The indexer would also be written in a high-level language. Since the indexer would only need to parse a standardized XML dialect, the parser could simply throw an exception on faulty data events, and the indexer would drop the ball immediately. There are no buffer overflows possible. We recognize that bad-logic-induced defects could appear in our product. We will be preemptive and proactive.

I doubt text-based XML messages will be a good option for communication
between say a system indexer and specific indexing components. When you
consider the number of document types that could be on a system, this
could represent a major performance bottleneck.

Perhaps if we were to XMLize full-text-indexing searches, I'd agree. But in practice, data indexing plugins wouldn't XML data. Only metadata indexing plugins would. Why? Because there's less of a performance impact. The security, component-discreteness and convenience advantages of parsing XML largely outstrip the performance gain derived from a CORBA/COM/RPC/stdout-based solution. And the indexing speed is secondary. Primary thing for the user experience is the searching speed.

Doing comp0letely live updated "indexed" searches for a whole filesystem
requires a filesystem that supports it. None of the commonly deployed
filesystems (that I know of) currently do. FAM, as I mentioned before,
won't be able to do this.

I showed you that FAM can do this. Trust me, I was amazed, it can. It apparently doesn't monitor each file, but uses dnotify to receive global events and deliver them as pathnames to the application. On systems which don't support /dev/imon or dnotify, we have a serious problem (FAM would attempt to poll() or stat() *each* file, which most certainly would kill the system).

Not on Unix, at least AFAIK. Disk activity is unfortunately not (to my
knowledge) monitorable, nor does it seem to be adequately handled by
having a low priority. As far as we could tell with medusa, the indexer
would hog the system wrt to disk activity no matter how low the priority
was set.

Why? Let me answer my own question. Linux 2.4 allowed for applications to hog the disk, even on low priority events. But for 2.6 Linux, the disk elevator algorithms combined with the low-latency efforts and the speculative whatnot *do* improve system responsiveness under high disk throughput, and prevent applications from hogging the disk transfer bandwidth. This will make "nice -20 indexd" possible. I'm also sure that Solaris can cope with this too. That leaves the BSDs. And the BSDs are champ in performance as well.

Besides, it's our idea that this kind of problem is the architecture's fault and any efforts to solve it should be directed towards the underlying architecture, not into building a workaround. Systemic thought: any system can be effectively changed, it's just a matter of finding the leverage point.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]