Re: FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)



> I had problems finde some pdf files with beagle-query.
> I think the problem is pdftotext some times returns invalid utf8 data -
> probably in some documents with danish letter æøåÆØÅ
>
> wrapping pdftotext  to below seems to work:
>
>     /usr/bin/pdftotext -q -nopgbrk -enc Latin1 "$FILE" - | iconv -t UTF-8 -f
> iso8859-1

The -enc is supposed to control the text output encoding. Beagle uses
-enc utf8. If doing -enc Latin1 and then passing the result through
iconv to change it to utf8 outputs valid utf8 text, then it is
definitely a bug with pdftotext. pdftotext -enc utf8 should have
produced correct utf8 text itself.

You might want to look into xpdf bugzilla and see if they have any
related bugs opened.
Thanks,

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]