Re: Will Beagle index PDFs?



You should just be able to re-use your existing html parser anyhow.

Thats how I pulled the metadata in for my indexer.

There are also a number of other external converters that generate html,
so why no make an ExternalConverterViaHTml abstract base class, which
will typically only need the actual external converter overriding in
specific sub-class?

Then you can use:
pdf2text with the htmlmeta flag
rtf2html (http://www.w3.org/Tools/HTMLGeneration/rtf2html.html)
xlhtml  (http://chicago.sourceforge.net/xlhtml/)
ppthtml (distributed with the above)
wvhtml (ships as part of wvware
http://wvware.sourceforge.net/wvWare.html)

These would give you pdf, rtf, Excel, Powerpoint and Word indexing
respectively.

Julian

On Tue, 2004-07-27 at 10:33 -0500, Jon Trowbridge wrote:
> On Tue, 2004-07-27 at 13:31 +0100, Christopher Orr wrote:
> > I'm not sure if it's doing things the right way within the context 
> > of the Beagle framework, but nevertheless it does work.
> 
> Yes, it is doing things the right way. :)  I've committed your patch to
> CVS.
> 
> It is too bad that pdftotext doesn't provide a straightforward way to
> get at the metadata.  Maybe we should be parsing the output of
> 'pdftotext -htmlmeta' instead --- it puts the metadata in <meta> tags,
> and the HTML it generates is so simplistic that we should be able to
> strip it out without too many problems.
> 
> Thanks,
> -J
> 
> 
> 
> _______________________________________________
> Dashboard-hackers mailing list
> Dashboard-hackers gnome org
> http://mail.gnome.org/mailman/listinfo/dashboard-hackers
> 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]