Re: [Tracker] Extracting the extractors

From: Philip Van Hoof <philip codeminded be>
To: Cosimo Cecchi <cosimoc gnome org>
Cc: Tracker mailing list <tracker-list gnome org>
Subject: Re: [Tracker] Extracting the extractors
Date: Wed, 27 Apr 2016 11:46:16 +0200

Hi guys,

Note to Sam and Cosimo that if his becomes a API used by external users,
that it will in time have to follow API rules. These include not only
not breaking API unless incrementing the major version numbering (which
is something you shouldn't do every other week) but also things like
documentation and maintainership (as a good citizen).

Kind regards,

Philip


On Wed, 2016-04-27 at 09:57 +0530, Cosimo Cecchi wrote:

Hey Sam,


A little late on this thread, but this sounds awesome!
We actually chose JSON-LD too to represent record metadata for our
offline content applications at Endless and I would be really happy if
we could start using the Tracker extractors instead of rolling our own
to extract metadata.
I understand from this thread that it may not be the best format for
Tracker's internals, but having it as an option would definitely be an
useful thing for us.


Thanks for working on this, I look forward to see it land!


Cheers,
Cosimo

On Sat, Apr 9, 2016 at 5:09 AM, Sam Thursfield <ssssam gmail com>
wrote:
        Hi all
        
        I've always felt like Tracker's extractors should be reusable
        outside
        Tracker. The design makes that possible but right now they
        output their
        results as a series of slightly non-standard SPARQL update
        commands,
        which I don't think is useful for many folk. Lots of people
        aren't using
        SPARQL databases at all, believe it or not :-)
        
        The whole point of RDF is to make data interchange easy so I
        think we
        can do better than that. I've been looking at making the
        extractors
        optionally output their results in JSON-LD[1] format instead.
        The cool
        thing about JSON-LD is that if you squint, it's just good old
        JSON that
        everyone's familiar with. If you look closely it's also Linked
        Data,
        but in a more human-friendly serialization format than any of
        the more
        traditional RDF formats.
        
        The catch here is that Tracker's extractor modules are all
        hardwired to
        generate SPARQL using TrackerSparqlBuilder. To be honest I've
        never
        liked this approach, it's pretty incomprehensible to newcomers
        and
        overly verbose, especially where we explicitly generate DELETE
        queries
        to go along with the INSERT queries.
        
        so, inspired by something in the Python RDFLib library, I came
        up with a
        TrackerResource class that the extractors can use instead.
        This is a
        work in process, but I have a branch in git.gnome.org that
        adds
        TrackerResource, and converts some of the extractors to use
        it. The
        TrackerResource class can serialize either to SPARQL update
        commands or
        to JSON-LD. The branch also adds the `tracker extract` command
        from
        <https://bugzilla.gnome.org/show_bug.cgi?id=751991> so you can
        try out
        the extractors easily and specify `-o json` or `-o sparql` as
        you prefer.
        
        The results for extractors I've converted so far is promising
        in terms
        of reducing
        code size:
        
             src/tracker-extract/tracker-extract-abw.c       |  51
        ++--
             src/tracker-extract/tracker-extract-bmp.c       |  18 +-
             src/tracker-extract/tracker-extract-dvi.c       |  17 +-
             src/tracker-extract/tracker-extract-epub.c      | 131
        +++-----
             src/tracker-extract/tracker-extract-gstreamer.c | 910
        ++++++++++++++++++-------------------------------------
             src/tracker-extract/tracker-extract-mp3.c       | 378
        ++++++++---------------
             6 files changed, 511 insertions(+), 994 deletions(-)
        
        Here's an example of auto-generated SPARQL for an MP3
        extraction:
        
            DELETE {
            }
            WHERE {
            <file:///home/sam/Downloads/Best%20Coast%20-%20The%20Only%
        20Place.mp3>
        nie:comment ?nie_comment ;
                 nmm:trackNumber ?nmm_trackNumber ;
                 nmm:performer ?nmm_performer ;
                 nfo:averageBitrate ?nfo_averageBitrate ;
                 nmm:musicAlbum ?nmm_musicAlbum ;
                 nfo:channels ?nfo_channels ;
                 nmm:dlnaProfile ?nmm_dlnaProfile ;
                 nmm:musicAlbumDisc ?nmm_musicAlbumDisc ;
                 rdf:type ?rdf_type ;
                 nfo:duration ?nfo_duration ;
                 nfo:codec ?nfo_codec ;
                 nmm:dlnaMime ?nmm_dlnaMime ;
                 nfo:sampleRate ?nfo_sampleRate ;
                 nie:title ?nie_title .
            }
            DELETE {
            }
            WHERE {
            <urn:artist:Best%20Coast> nmm:artistName ?nmm_artistName ;
                 rdf:type ?rdf_type .
            }
            INSERT {
            <urn:artist:Best%20Coast> a nmm:Artist ;
                 nmm:artistName "Best Coast" .
            }
            DELETE {
            }
            WHERE {
            <urn:album:The%20Only%20Place>
        nmm:albumTitle ?nmm_albumTitle ;
                 rdf:type ?rdf_type ;
                 nmm:albumArtist ?nmm_albumArtist .
            }
            INSERT {
            <urn:album:The%20Only%20Place> a nmm:MusicAlbum ;
                 nmm:albumTitle "The Only Place" ;
                 nmm:albumArtist <urn:artist:Best%20Coast> .
            }
            DELETE {
            }
            WHERE {
            <urn:album-disc:%D0:%06%02:Disc1>
        nmm:setNumber ?nmm_setNumber ;
                 nmm:albumDiscAlbum ?nmm_albumDiscAlbum ;
                 rdf:type ?rdf_type .
            }
            INSERT {
            <urn:album-disc:%D0:%06%02:Disc1> a nmm:MusicAlbumDisc ;
                 nmm:setNumber 1 ;
                 nmm:albumDiscAlbum <urn:album:The%20Only%20Place> .
            }
            INSERT {
            <file:///home/sam/Downloads/Best%20Coast%20-%20The%20Only%
        20Place.mp3>
        a nmm:MusicPiece , nfo:Audio ;
                 nie:comment "Free download from
        http://www.last.fm/music/Best+Coast and http://MP3.com"; ;
                 nmm:trackNumber 1 ;
                 nmm:performer <urn:artist:Best%20Coast> ;
                 nfo:averageBitrate 128000 ;
                 nmm:musicAlbum <urn:album:The%20Only%20Place> ;
                 nfo:channels 2 ;
                 nmm:dlnaProfile "MP3" ;
                 nmm:musicAlbumDisc <urn:album-disc:%D0:%06%
        02:Disc1> ;
                 nfo:duration 164 ;
                 nfo:codec "MPEG" ;
                 nmm:dlnaMime "audio/mpeg" ;
                 nfo:sampleRate 44100 ;
                 nie:title "The Only Place" .
            }
        
        Note there are a lot more DELETE statements than before. I
        figured that
        anywhere we want to replace the existing data we need a DELETE
        statement, and the reason we don't normally do it is because
        previously
        it had to be done manually. That said, the TrackerResource
        class does
        have a way of avoiding this. If you ever call _set_value() for
        a property then
        it assumes you want to *overwrite* it, and will generate a
        DELETE. If you
        only use _add_value() then it will assume you want to *add* to
        it, and won't
        generate a DELETE. The latter case is needed for stuff like
        nao:hasTag.
        I may be misunderstanding things here of course, I didn't
        actually write any
        of the extractors myself.
        
        Here's a example of JSON-LD output:
        
        {
          "nie:comment" : "Free download from
        http://www.last.fm/music/Best+Coast and http://MP3.com";,
          "nmm:trackNumber" : 1,
          "nmm:performer" : {
            "@id" : "urn:artist:Best%20Coast",
            "nmm:artistName" : "Best Coast",
            "@type" : "nmm:Artist"
          },
          "nfo:averageBitrate" : 128000,
          "nmm:musicAlbum" : {
            "@id" : "urn:album:The%20Only%20Place",
            "nmm:albumTitle" : "The Only Place",
            "@type" : "nmm:MusicAlbum",
            "nmm:albumArtist" : {
              "@id" : "urn:artist:Best%20Coast",
              "nmm:artistName" : "Best Coast",
              "@type" : "nmm:Artist"
            }
          },
          "nfo:channels" : 2,
          "nmm:dlnaProfile" : "MP3",
          "nmm:musicAlbumDisc" : {
            "@id" : "urn:album-disc:%C0:L%01:Disc1",
            "nmm:setNumber" : 1,
            "nmm:albumDiscAlbum" : {
              "@id" : "urn:album:The%20Only%20Place",
              "nmm:albumTitle" : "The Only Place",
              "@type" : "nmm:MusicAlbum",
              "nmm:albumArtist" : {
                "@id" : "urn:artist:Best%20Coast",
                "nmm:artistName" : "Best Coast",
                "@type" : "nmm:Artist"
              }
            },
            "@type" : "nmm:MusicAlbumDisc"
          },
          "nfo:duration" : 164,
          "nfo:codec" : "MPEG",
          "nmm:dlnaMime" : "audio/mpeg",
          "nfo:sampleRate" : 44100,
          "nie:title" : "The Only Place"
        }
        
        We can actually do much better than this, right now there's no
        @context so it kind of misses the point of JSON-LD. I need to
        finish writing a NamespaceManager class that can track all of
        the
        prefixes and generate a suitable JSON-LD context, so that
        instead
        of stuff like "nie:title", it can just say "title" and then
        the @context
        will link that to
        <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>
        
        The code is in branch wip/sam/resource:
        <https://git.gnome.org/browse/tracker/log/?h=wip/sam/resource>.
        
        It's still of course a work in progress but I think it's
        pretty much taken
        shape, so please have a look and give feedback on whether you
        think
        this is a sane approach!
        
        Thanks
        Sam
        
        [1]: http://json-ld.org/
        _______________________________________________
        tracker-list mailing list
        tracker-list gnome org
        https://mail.gnome.org/mailman/listinfo/tracker-list


_______________________________________________
tracker-list mailing list
tracker-list gnome org
https://mail.gnome.org/mailman/listinfo/tracker-list

Attachment: signature.asc
Description: This is a digitally signed message part

References:
- [Tracker] Extracting the extractors
  - From: Sam Thursfield
- Re: [Tracker] Extracting the extractors
  - From: Cosimo Cecchi

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]