Re: [Tracker] FTS4 branch review



On 05/02/13 17:07, Martyn Russell wrote:
Hello all,

So Carlos recently finished the fts4 branch for review. For those who
don't know what this is, there is a nice blog from Carlos here:

http://blogs.gnome.org/carlosg/2013/01/28/snippets-in-trackers-full-text-search-results/

So one of the things that we've done with the FTS4 branch is to remove the tracker:fulltextNoLimit property. To counter this, we've also started indexing ALL content, not just those words >= min-word-length (which is a configuration option defaulting to 3 characters).

We've done this because we now use the upstream fts4 module and at the level we tokenise the data, we can't check the property configuration to know if we should be indexing ALL words or just words of a certain length.

I've recently done some quick analysis of FTS4 vs 0.14.5 to make sure that we're not causing a serious performance or query regressions with this and here is what I have found...

So the data set:

        Tags            30
        Contacts        329
        Audios          8243
        Documents       73
        Files           10188
        Folders         956
        Images          669
        Applications    285
        Videos          499
        Albums          1139
        Music Tracks    7744
        Photos          433

The tracker-stats output for those interested in the details is below, but NOTE, the stats might be slightly different due to extraction failures I noticed and mention later on. These are the stats for the FTS4 work.

$ tracker-stats
Statistics:
  mfo:Action = 1
  mlo:LandmarkCategory = 15
  mto:State = 6
  mto:TransferMethod = 2
  mtp:ScanType = 6
  nao:Tag = 30
  nco:AuthorizationStatus = 3
  nco:Contact = 329
  nco:Gender = 3
  nco:IMCapability = 8
  nco:PersonContact = 1
  nco:PresenceStatus = 9
  nco:Role = 2354
  nfo:Audio = 8294
  nfo:DataContainer = 1120
  nfo:Document = 73
  nfo:Equipment = 3
  nfo:Executable = 285
  nfo:FileDataObject = 10188
  nfo:Folder = 956
  nfo:Image = 667
  nfo:Media = 8961
  nfo:MediaList = 1134
  nfo:Orientation = 8
  nfo:PaginatedTextDocument = 37
  nfo:PlainTextDocument = 36
  nfo:RegionOfInterestContent = 5
  nfo:Software = 285
  nfo:SoftwareApplication = 285
  nfo:SoftwareCategory = 164
  nfo:TextDocument = 73
  nfo:Video = 499
  nfo:Visual = 1166
  nie:DataObject = 10188
  nie:DataSource = 4
  nie:InformationElement = 15295
  nmm:Artist = 2025
  nmm:Flash = 2
  nmm:MeteringMode = 7
  nmm:MusicAlbum = 1134
  nmm:MusicAlbumDisc = 1265
  nmm:MusicPiece = 7795
  nmm:Photo = 431
  nmm:RadioModulation = 2
  nmm:Video = 499
  nmm:WhiteBalance = 2
  nmo:DeliveryStatus = 5
  nmo:PhoneMessageFolder = 5
  nmo:ReportReadStatus = 3
  nrl:InverseFunctionalProperty = 3
  rdf:Property = 629
  rdfs:Class = 233
  rdfs:Resource = 16321
  scal:AccessLevel = 3
  scal:AttendanceStatus = 7
  scal:AttendeeRole = 4
  scal:CalendarUserType = 5
  scal:EventStatus = 3
  scal:JournalStatus = 4
  scal:RSVPValues = 2
  scal:TodoStatus = 4
  scal:TransparencyValues = 2
  slo:LandmarkCategory = 15
  tracker:Namespace = 23
  tracker:Ontology = 20
  tracker:Volume = 3

The tests I did include:

a) Testing tracker-search with "foo", "love" and "martyn" to make sure we get the same results with FTS queries.

b) Comparing the DB sizes to make sure we're not inflating our data collective with the new FTS changes.

c) Comparing indexing time.

--

Test A (FTS4)
=============

$ tracker-search foo
Results:
  file:///home/martyn/Documents/Important/%23foo.gpg%23
  file:///home/martyn/Documents/tracker-tests-fts4

file:///home/martyn/Remotes/GrapeVine/Music/Santana/Shaman/Disc%201%20-%206%20-%20Foo%20Foo.mp3

$ tracker-search love|wc -l
492

$ tracker-search martyn|wc -l
32


Test A (0.14.5)
===============

EXACTLY the same.


Test B (FTS4)
=============

$ ls -lh ~/.local/share/tracker/data/ ~/.cache/tracker/
/home/martyn/.cache/tracker/:
total 27M
-rw-rw-r-- 1 martyn martyn   11 Feb 14 18:23 db-locale.txt
-rw-rw-r-- 1 martyn martyn    2 Feb 14 18:23 db-version.txt
-rw-rw-r-- 1 martyn martyn    6 Feb 14 18:33 first-index.txt
-rw-rw-r-- 1 martyn martyn   10 Feb 14 18:33 last-crawl.txt
-rw-r--r-- 1 martyn martyn  25M Feb 14 18:33 meta.db
-rw-r--r-- 1 martyn martyn  32K Feb 14 18:34 meta.db-shm
-rw-r--r-- 1 martyn martyn 1.5M Feb 14 18:34 meta.db-wal
-rw-rw-r-- 1 martyn martyn   11 Dec 24 10:24 miner-applications-locale.txt
-rw-rw-r-- 1 martyn martyn 344K Feb 14 18:23 ontologies.gvdb

/home/martyn/.local/share/tracker/data/:
total 16M
-rw-rw---- 1 martyn martyn 9.6M Feb 14 18:34 tracker-store.journal
-rw-rw---- 1 martyn martyn 5.6M Feb 14 18:23 tracker-store.ontology.journal


Test B (0.14.5)
===============

$ ls -lh ~/.local/share/tracker/data/ ~/.cache/tracker/
/home/martyn/.cache/tracker/:
total 34M
-rw-rw-r-- 1 martyn martyn   11 Feb 14 18:40 db-locale.txt
-rw-rw-r-- 1 martyn martyn    2 Feb 14 18:40 db-version.txt
-rw-rw-r-- 1 martyn martyn    6 Feb 14 18:49 first-index.txt
-rw-rw-r-- 1 martyn martyn   10 Feb 14 18:49 last-crawl.txt
-rw-r--r-- 1 martyn martyn  24M Feb 14 18:49 meta.db
-rw-r--r-- 1 martyn martyn  96K Feb 14 18:53 meta.db-shm
-rw-r--r-- 1 martyn martyn 9.8M Feb 14 18:53 meta.db-wal
-rw-rw-r-- 1 martyn martyn   11 Dec 24 10:24 miner-applications-locale.txt
-rw-rw-r-- 1 martyn martyn 344K Feb 14 18:40 ontologies.gvdb

/home/martyn/.local/share/tracker/data/:
total 16M
-rw-rw---- 1 martyn martyn 9.6M Feb 14 18:53 tracker-store.journal
-rw-rw---- 1 martyn martyn 5.7M Feb 14 18:40 tracker-store.ontology.journal


Test C (FTS4)
=============

Tracker-INFO: --------------------------------------------------
Tracker-INFO: Total directories : 1061 (107 ignored)
Tracker-INFO: Total files       : 8997 (148 ignored)
Tracker-INFO: Total processed   : 9804 (9804 notified, 0 with error)
Tracker-INFO: --------------------------------------------------

Tracker-INFO: Idle
Tracker-INFO: Finished mining in seconds:569.902854, total directories:1061, total files:8997


Test C (0.14.5)
===============

Tracker-INFO: --------------------------------------------------
Tracker-INFO: Total directories : 1061 (107 ignored)
Tracker-INFO: Total files       : 8997 (148 ignored)
Tracker-INFO: Total processed   : 9803 (9803 notified, 58 with error)
Tracker-INFO: --------------------------------------------------

Tracker-INFO: Idle
Tracker-INFO: Finished mining in seconds:538.760078, total directories:1061, total files:8997


Conclusions:
============

For Test A, we can see nothing has changed with our simple tests. So the data set seems in tact for FTS searches.

For Test B, the database size for Tracker with FTS4 is much smaller. So while we might be indexing more words (i.e. those which are smaller than 3 characters), we're still a smaller database. The reason for this could be that we were previously duplicating data (Carlos can confirm this) and now we're using the data only once. Either way, a smaller database is always preferred if we can have it.

For Test C, this might not be an accurate portrayal of the situation. First, you may notice we had errors with 0.14.5 and that means 58 items were not indexed. That will definitely affect the time to finish indexing. Second, ALL the music (which accounts for a majority of the data indexed here) was being indexed over a encfs mounted directory to a server (with a GB connection) on my local network. I was also playing music (also on the server at the same time) and that will affect the bandwidth available too. So I am not convinced the speed test was entirely fair. However, if you work out an approximation for time per item processed, it's ca. 0.058 secs (FTS4) vs 0.055 secs (0.14.5). There isn't much in it. So performance wise, I don't think we're noticeably worse than we were.

--

If anyone has any comments, they're welcome. I plan to release 0.15.2 tomorrow with the FTS4 work and if there are no complaints, we may release a 0.16.0 in time for the GNOME 3.8 release.

--
Regards,
Martyn

Founder and CEO of Lanedo GmbH.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]