[Tracker] Reviving the libstreamanalyzer based extractor module

From: Philip Van Hoof <philip codeminded be>
To: Tracker mailing list <tracker-list gnome org>
Subject: [Tracker] Reviving the libstreamanalyzer based extractor module
Date: Fri, 14 Jun 2013 16:23:07 +0200

Hi team,

During a Tracker/Nepomuk/SPARQL training I gave at one of my customers Inoted the interest in extractors that can dive into archives anddocument types that have a tree of other documents (like MIME documents).

Right now a Tracker extractor module can't extract an MP3 that isattached in an E-mail located in a Maildir or stored in a tar.gz file.My recommendation for E-mail client authors is and will probably alwaysbe to store MIME parts, that have disposition set to attachment or not,Base64 decoded on the filesystem. So that means that the cache of anE-mail client stores an MP3 attachment of an E-mail ... as an MP3. Andnot as a blob of Base64 encoded quote plain text unquote (which isstupid and not useful). That way writing a miner for such an E-mailclient would mean to configure the FS miner to just index the cachedirectory (and perhaps tweak the nie:url value and add nmo rdf:typequalifiers).

That or libtracker-extract should allow a stream or buffer basedextraction, and/or a file descriptor based one (in which case we couldpass the extractor modules, the ones now only used by tracker-extract, aby pipe created FD from the E-mail client, and write the Base64 decodeddata to the pipe FD - or something). Unfortunately is tracker-extractright now entirely FILE based (not really FD based, nor stream based).

Also do some use-cases of Tracker's FS miner want files in archives tobe indexed.

Tracker's native extractors can't do any of this. That's because theyare open/seek/read/close based and not stream based. Thelibstreamanalyzer library aims to implement extraction of metadata in astream based way, with support for diving into archives and MIME documents.

It was for that reason that I once wrote tracker-topanalyzer.cpp in thesrc/tracker-extract directory. It's unmaintained nowadays.

I think it would be a great first addition if the tracker-extract .rulefile based environment would be adapted to have two levels of matching:first on container and then on MimeType. The first level would for allof its native extractors be "Just File", and for the libstreamanalyzer'sbe "MIMEDocument" and "Archive". The second level would be the same asnow. Ideally this level system could also be used for multimedia files(videos have first a MIME type and then a codec type, for example).

Then would it start being possible for a extractor module liketracker-topanalyzer.cpp to get kicked into action for diving intoarchive files and MIME documents (and the native ones would stilloperate on native file types).

Also should the tracker-topanalyzer.cpp be fixed. It has been a longtime that it was last tested and I don't expect it to still work. Andfor it to work it would probably be needed that libstreamanalyzer getsadapted to follow Tracker's Nepomuk adaptations (right nowlibstreamanalyzer doesn't know about the nmm ontology, afaik).



Kind regards,

Philip

Follow-Ups:
- Re: [Tracker] Reviving the libstreamanalyzer based extractor module
  - From: Ivan Frade

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]