Re: FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)

From: "D Bera" <dbera web gmail com>
To: "Karsten Rasmussen" <frommetoyou comxnet dk>
Cc: dashboard-hackers gnome org
Subject: Re: FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)
Date: Mon, 24 Mar 2008 12:44:18 -0400

> I had problems finde some pdf files with beagle-query.
> I think the problem is pdftotext some times returns invalid utf8 data -
> probably in some documents with danish letter æøåÆØÅ
>
> wrapping pdftotext  to below seems to work:
>
>     /usr/bin/pdftotext -q -nopgbrk -enc Latin1 "$FILE" - | iconv -t UTF-8 -f
> iso8859-1

The -enc is supposed to control the text output encoding. Beagle uses
-enc utf8. If doing -enc Latin1 and then passing the result through
iconv to change it to utf8 outputs valid utf8 text, then it is
definitely a bug with pdftotext. pdftotext -enc utf8 should have
produced correct utf8 text itself.

You might want to look into xpdf bugzilla and see if they have any
related bugs opened.
Thanks,

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user

References:
- FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)
  - From: Karsten Rasmussen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]