Re: Two questions about medusa



>>>>> Curtis Hovey writes:

 > On Mon, 2003-05-05 at 11:35, David C Sterratt wrote:
 >> I've had more of a look at medusa, and I've got two questions:
 >>
 >> Firstly, it looks as though to add a content indexer for
 >> application/pdf would require writing some c-code to make a PDF
 >> indexing module.  Is that right?  Other indexers (e.g. htdig IIRC)
 >> allow plugins specified in a configuration file to convert between
 >> mime types (e.g. pdftotext for application/pdf to text/plain).  Is
 >> this planned for medusa?

 > Yes.  My priorities are:
 > 1. Upgrade the progress dialog to gtk2 in nautilus
 > 2. Restore keyword/emblem indexing.
 > 3. Distribute patches and notes to get back advice, criticism, and
 > flames.
 > 4. OpenOffice indexer
 > 5. MSOffice indexer
 > 5. PS/PDF indxer

 > I believe 1, 2, 3 will be done in the next 7 days (I got a lot done
 > in the last week).  The indexers will be added in the subsequent
 > weeks.  It only takes a few days to do an indexer.  Word on the
 > street has it, that I'll be laid off from TimeLife.com in the next
 > month so I'll finally have time to work on something interesting,
 > barring the fact I've got to look for a job.

Excellent.  (That you'll be doing some work on medusa, not losing your
job.)

 > I've toyed with the plugin idea as it might get some things done
 > quickly.  I'd like to bring some intelligence to what is indexed,
 > and the plain text indexer cannot handle that.  XML content like
 > OpenOffice is very rich and it would loose it some of it's meaning
 > and relevance if it were crudely converted to plain text.  PDFs
 > don't have any meaning. They would be fine in your solution.  We
 > need to weigh the capability of adding ad hoc indexers verses their
 > potential dependencies.

How would one use the rich semantics in openoffice files as search
terms?  At the moment, the searching semantics only allow for
"contatins any or all of".  I suppose you could extend them to things
like "author matches", but that might be confusing if some of the
other documents you're searching don't have the rich semantic
information, since you wouldn't be able to retrieve (say) PDF files
written by a particular author, but you would get OO files written by
them.  

 >> Secondly, it looks as though medusa can't search for phrases or
 >> words including globbing characters.  Is that right?

 > Yeah.  That is a weakness, and a difficult one to overcome. I can
 > image how to add the phrase capabilities by adding some additional
 > index information.  The globing (* and ?) could be done with some
 > ungraceful hacks--but I think we would need to get the OR
 > functionality working.

Would it be possible to combine another, more sophisticated full text
indexer library with the medusa code for the filesystem properties of
files?  Or if not use a library use some code?  I've found at least
one indexer that does wildcard and phrase searching (Swish-e), but it
can't do incremental reindexing.

David.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]