Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?
- From: Meik Hellmund <Meik Hellmund math uni-leipzig de>
- To: tracker-list gnome org
- Subject: Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?
- Date: Mon, 6 Oct 2008 21:08:10 +0200
On Sun, 05 Oct 2008 23:15:09 +0300
Ivan Frade <ivan frade nokia com> wrote:
Hi, Meik,
El sáb, 04-10-2008 a las 17:34 +0200, ext Meik Hellmund escribió:
 - It seems that Postscript, Dvi and Djvu documents are not fully
   indexed, only the metadata are used. How can I change this?
 You need to write a filter that prints the content of those files in
the standard output. Check the scripts in /usr/lib/tracker/filters. 
 Our convention is:
/usr/lib/tracker/filters/[mimetype]_filter
so application/pdf files are filtered with:
/usr/lib/tracker/filters/application/pdf_filter
 You need to write the filters for application/postscripts,
application/x-dvi, application/x-dvi-tar and image/vnd.djvu
 Use the pdf filter as example and it is very easy to write more.
Great. I use now:
/usr/lib/tracker/filters/application/postscript_filter:
     #!/bin/sh
     nice -n19 ps2txt  "$1" "$2"
/usr/lib/tracker/filters/application/x-dvi_filter:
     #!/bin/sh
     nice -n19 catdvi -e 0   "$1" > "$2"
(after "apt-get install catdvi") 
and it works fine for Postscript and Dvi. 
But Djvu is still not working:
 - It seems that Djvu files are classified as "images".
   This may be true in a technical sense, but djvu is a format
   especially adopted for scanned text and most djvu documents are
   scanned books and similar. 
   I think you should reclassify them as "documents".   
 In /usr/share/tracker/services/default.services you can see the
mime-types assigned to each category. Try to move the djvu mimetype to
the documents category (and reindex).
I added "image/vnd.djvu"  to the "Mimes=.." line in the
[Documents] chapter in this file, but it didn't help.
  
On Sun, 5 Oct 2008 22:32:19 +0200
"Michael Biebl" <mbiebl gmail com> wrote:
For djvu, there is already a a filter
/usr/lib/tracker/filters/text/djvu_filter
It should index the content of djvu files, but it requires the
djvulibre-bin package being installed. (The tracker deb package has a
recommends on this package).
The filter itself works. According to Ivan's explanation about filter names, I also
copied it to  /usr/lib/tracker/filters/image/vnd.djvu_filter
But it isn't used by trackerd. I still get from "trackerd -v 3 -R":
   processing /home/hellmund/PS/no_series_187.djvu with action TRACKER_ACTION_CHECK and counter 0 mime is 
image/vnd.djvu
   for /home/hellmund/PS/no_series_187.djvu file extension is djvu
   file /home/hellmund/PS/no_series_187.djvu is indexable
   file /home/hellmund/PS/no_series_187.djvu has fulltext 0 with service Images 
   Indexing /home/hellmund/PS/no_series_187.djvu with service Images and mime image/vnd.djvu (new) 
   service id for Images is 6 and sid is 1279 with mime image/vnd.djvu
So it seems to me that it is not fulltext-indexed since it is categorized as an Image.
I also did an "strace -f trackerd -R" and found that /usr/share/tracker/services/default.service
is never read by trackerd, only the /usr/share/tracker/services/*.metadata files are opened.
Any ideas?
 - How about compressed files? The documentation mentions that .gz
   files are supported. What about .bz2? Is it possible to add a
filter for other compression methods?
Let me ask this question again. I have a lot of .ps.gz and .ps.bz2 files
and at the moment they are not indexed by tracker. Of course disk space is cheap nowadays and
I could uncompress them all. But what is tracker's expected behaviour? 
Many thanks for your time & answers!
Meik
-- 
Meik Hellmund
Mathematisches Institut, Uni Leipzig
e-mail: Meik Hellmund math uni-leipzig de
http://www.math.uni-leipzig.de/~hellmund
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]