Re: Building off Medusa

From: "Manuel Amador (Rudd-O)" <amadorm usm edu ec>
To: Seth Nickell <snickell stanford edu>
Cc: sinzui cox net, desktop-devel-list gnome org, gnome-devel-list gnome org
Subject: Re: Building off Medusa
Date: Thu, 10 Apr 2003 09:02:23 -0500



Just so you know... Incremental indexing won't be possible using libfam.
FAM will not scale to monitoring over about 500 files, so you definitely
will not be able to get change notification on all the files on a disk.
I would love a way to register with the kernel to be notified whenever
*any* file changes, but I don't believe there is such a mechanism.

Oh, but it does, and it does well (at least on Linux and Solaris). Ijust got a command-line tool which reports on stdout all files changed.You run it with a directory or file as the sole argument, and it spitsout all modified files' paths, half a second after they're modified.Fam does *not* monitor *each* file. I don't know how it does, but itdoesn't watch every file. If it would, I would agree with you.

One concern I would have using a system indexer is that only system
installed services will have their metadata indexed. For example, it
means that if I install a new Word processor in my account (or on my
local machine, though in some system setups I wouldn't have system
access to my local machine), my word processor documents on the NFS
server (or local machine) will not get indexed with their special
metadata.

As long as the server (NFS or local) has the indexing plugins installed,it will index the documents, whether you do have the applicationinstalled in the client computer or not. We'd recommend ISVs to writeplugins which may use their application libraries, but keep themindependent from them. You would also be able to see some metadata(document title, topics, etcetera) in the search interface.

We are definitely not planning to create a search service that isper-user-installable. It will be an all-or-nothing proposition.Remember we're aiming for the corporate customer with our enterpriseedition and the general linux distro with the open-source release. Wewill also attempt to have it be enabled by default. This is areasonable and sensible conclusion of the concepts of security and easeof use embedded in our product. We also plan to be extra careful andhave the OSS community audit the code so we can nail all potentialsecurity problems ASAP.

That would help, but as you say will only alleviate the problem.

Evidently. The other cornerstone is what we call "abort on faulty data,no questions asked". Let me explain a bit:


The "root exposure" threat model is limited to three components:
1) The file monitor (indirectly via FAM)
2) The indexing plugins (directly to the files)
3) The indexing service (indirectly via the indexing plugins)

(we plan to make the search service a user-level process, although inprinciple it will be written with the same security pragmas)

Any of those can turn into a monster by faulty data. Now let's analyzeeach of them. The file monitor gets data via FAM. FAM only passespathnames to the file monitor. In consequence, we have no other choicebut trust the operating system, and the pathnames, because the OS is thesource of those pathnames. The only caveat is that we need to deal witheach file. Since minimal procecssing is involved in this pathname (onlymatching them against an exclusion list, that's all), the window ofexposure is small, and the data is mostly trustable.

The indexing plugins are the ones with the largest window of exposure.They will be directly exposed to faulty data. In consequence, theplugin has the duty to reject anything that is not valid data.Manipulating data is easier and less error-prone with high-levellanguages - that also reduces our window of exposure. We know that,even after all these security precautions are taken, it's possible thata plugin is compromised. We take it seriously. But I'm sure that thevalue provided by our solution is much higher than the value lost to thepotential security threat, and that the security threat might beefficiently controlled (at least for our corporate customers, for whichwe might provide managed updates thru software distribution channels andthe like, but OSS users will also have updates available). If weweren't sure of this, we would simply dive into YAWVBA (yet anotherwindows/visual basic application).

The indexer would also be written in a high-level language. Since theindexer would only need to parse a standardized XML dialect, the parsercould simply throw an exception on faulty data events, and the indexerwould drop the ball immediately. There are no buffer overflows possible.We recognize that bad-logic-induced defects could appear in ourproduct. We will be preemptive and proactive.

I doubt text-based XML messages will be a good option for communication
between say a system indexer and specific indexing components. When you
consider the number of document types that could be on a system, this
could represent a major performance bottleneck.

Perhaps if we were to XMLize full-text-indexing searches, I'd agree.But in practice, data indexing plugins wouldn't XML data. Only metadataindexing plugins would. Why? Because there's less of a performanceimpact. The security, component-discreteness and convenience advantagesof parsing XML largely outstrip the performance gain derived from aCORBA/COM/RPC/stdout-based solution. And the indexing speed issecondary. Primary thing for the user experience is the searching speed.

Doing comp0letely live updated "indexed" searches for a whole filesystem
requires a filesystem that supports it. None of the commonly deployed
filesystems (that I know of) currently do. FAM, as I mentioned before,
won't be able to do this.

I showed you that FAM can do this. Trust me, I was amazed, it can. Itapparently doesn't monitor each file, but uses dnotify to receive globalevents and deliver them as pathnames to the application. On systemswhich don't support /dev/imon or dnotify, we have a serious problem (FAMwould attempt to poll() or stat() *each* file, which most certainlywould kill the system).

Not on Unix, at least AFAIK. Disk activity is unfortunately not (to my
knowledge) monitorable, nor does it seem to be adequately handled by
having a low priority. As far as we could tell with medusa, the indexer
would hog the system wrt to disk activity no matter how low the priority
was set.

Why? Let me answer my own question. Linux 2.4 allowed for applicationsto hog the disk, even on low priority events. But for 2.6 Linux, thedisk elevator algorithms combined with the low-latency efforts and thespeculative whatnot *do* improve system responsiveness under high diskthroughput, and prevent applications from hogging the disk transferbandwidth. This will make "nice -20 indexd" possible. I'm also surethat Solaris can cope with this too. That leaves the BSDs. And theBSDs are champ in performance as well.

Besides, it's our idea that this kind of problem is the architecture'sfault and any efforts to solve it should be directed towards theunderlying architecture, not into building a workaround. Systemicthought: any system can be effectively changed, it's just a matter offinding the leverage point.

Follow-Ups:
- Re: Building off Medusa
  - From: Seth Nickell

References:
- GNOME and advanced search indexes viability
  - From: Manuel Amador (Rudd-O)
- Re: GNOME and advanced search indexes viability
  - From: Manuel Amador (Rudd-O)
- Building off Medusa
  - From: Seth Nickell
- Re: Building off Medusa
  - From: Manuel Amador (Rudd-O)
- Re: Building off Medusa
  - From: Seth Nickell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]