Re: Interesting Post on Tracker



On 10/11/06, Joe Shaw <joeshaw novell com> wrote:
Hi,

On Wed, 2006-10-04 at 22:13 -0400, Kevin Kubasik wrote:
> I saw this post syndicated on Planet Gnome and thought that it deserves
> some attention. While it doesn't directly attack beagle, its 'points' do
> seem to hit a little close to home on beagles weaker (er) spots, a good
> read.

Yes, the digs are thinly veiled.

> I do encourage comment and opinions back to the list (as opposed to on
> his blog) so we can try to learn from tracker and see what works and
> doesn't.

After I read this, I decided to give Tracker another try.  I could never
get it to build when it was using mysql, and it still didn't build out
of the box for me, but I was able to hack a few things and get it going.

Ditto, but the build is getting more attainable, although this means
using sqlite as the backend for pretty much everything... which
doesn't offer full-text indexing support and is not nearly as fast
(although i think they are trying to balance that out with qmdb
somehow.)
There are a number of things in the post which seem to be exaggerated:

        * Mail indexing doesn't work at all; I got errors whenever I
        tried to turn on Evolution mail indexing.  Looking at the code,
        the current implementation is far too naive and if it did work
        wouldn't be useful.  The names of the mailboxes are currently
        hardcoded to (for Evo) "Inbox" and "Sent" and there is no
        support for anything but mboxes.  There isn't any logic to map a
        message to an Evolution-understandable URI, so it is not
        possible to open mail hits.  There is a lot of work to be done
        in this area.
I was able to index some mails, but tracker always crashed soon, and I
could not get/use the mail results (even from their new gui).

        * I didn't strictly measure indexing time, but it didn't feel
        any faster indexing my data than Beagle does.  Until Tracker as
        more coverage of indexable data, this probably isn't a relevant
        or fair comparison.

        * The memory usage is great, but it's not at the 3mb level.
        While indexing for me, it seemed to hover around the 7-9mb
        level.  In any case, still quite a bit better than Beagle.
Agreed, tracker will almost always take the cake here, but I don;t
think that 10mb is a reasonable goal for us (or anyone writing an app
of this size really) With a lot of new desktops shipping with 512 to a
gig by default, 30-40mb isn't unreasonable, granted, smaller is nicer,
and we should work to get the footprint down (and keep it there) but i
think a lot of people are getting more worked up than the adverage
user is about memory usage.

        * The API suggests that you can't search both the sqlite DB and
        the text index at the same time, which means that implementation
        details are pushed out onto the user, or at least onto a saavy
        programmer.  It doesn't seem possible to search for "eggplant
        veggie" where "eggplant" is in the text content and "veggie" is
        external metadata like a tag.
Which is not cool, as thats pretty complicated and expected from the
api, and its somewhat worthless. As they are basically just writing a
metadata storage with a mediocre text indexer. (ie. I think that we
should revisit using tracker as out metadata backend ;) )

        * The Pango word breaking he references is commented out as
        being too slow.  Lucene already handles CJK word breaking.

Yay Lucene ;) I love it when we get to inherit awesome stuff from them!
        * The only stemmer provided is English.  The stemmer uses the
        same well-known Porter stemming algorithm that is already used
        inside Lucene.  Also, the license of the snowball stemmer
        appears to be old-style BSD so it would be incompatible with GPL
        applications.
Uh-Oh....

Other notes:

        * Using QDBM as the text indexer is an interesting idea.  It is
        a lot lower-level than Lucene and probably would not be
        well-suited to Beagle's use because we store documents rather
        than just an ID to look up in a database.  The ability to search
        both text and metadata makes a move to this system inefficient.
        It may make more sense to switch to something Lucene-like like
        Ferret, which is written in C and purportedly gives a
        performance improvement.

        * The benchmarks cited about QDBM are revised in a followup
        article, and the slowness of Lucene is often found to be due to
        JVM warmup time.
Yeah saw that, which since we keep the system warm isn't an issue.

        * Tracker is really well optimized for returning URIs.  The
        Beagle search APIs return a full "Hit" object which contains all
        the metadata for a document.  In certain cases you just want a
        URI and we should probably expose an API for that, which will be
        substantially faster.
That would be a cool addition, some simple poking by me reveals that
doing this in the C# bindings would be pretty easy, however I'm not
the guy for libbeagle, if we have anyone interested, we can coordinate
and get some fun done.

        * The low-level components in Lucene are pretty well-tested
        upstream in both the Java and .Net versions.  From a Beagle
        standpoint, however, we could do well to have comprehensive test
        suits.  We have some testing tools, but the whole area could use
        a lot of improvement.  For example, version 0.2.9 shipped with a
        nasty bug in which removal notifications weren't being sent to
        clients.  Despite my test runs, the tools didn't catch this.
Agreed.

        * There are still quite a few bugs; the daemon would just die
        with no error message or anything quite often in the middle of
        indexing.  It never made it fully through.
Yay bugzilla! If only bugbuddy could attach to the daemon and
automagicaly get us logs/a mono stack.

        * Tracker uses a lot of CPU.  I have a dual-CPU box so tracker's
        CPU usage was often above 100% and was pretty consistently at
        70%.  If it has throttling like Beagle, it doesn't work nearly
        as well.  On the other hand, I didn't have any documents that
        caused it to spin at 100% CPU like Beagle sometimes does.
I saw the same, tracker slowed my system considerably and was almost
intrusive to the point where I had to close it while working, I don't
think its throttling is something to write home about.

        * My system got progressively slower as Tracker indexed.  I
        didn't investigate this, but when I returned to my machine after
        letting it index for a while, my system was noticeably slower; I
        was logging memory usage while it was running, however, and it
        never seemed to get out of control.  Not sure what is going on
        there.

Anyway, that's my rundown of things.  Basically Beagle's tasks are
unchanged: we need to rework the indexing to better handle user-supplied
metadata, we need to consolidate indexes into a fixed number rather than
one-per-backend to help reduce memory usage, and we need to focus on
fixing bugs in filters and backends so that our indexing process is more
robust.

We had a hackfest at the Boston GNOME summit with myself, Fredrik, Bera,
Daniel, and others.  I'll send a follow-up email about that.

Thanks,
Joe


Awesome to hear! Hope the hackfest went/goes well!


--
Cheers,
Kevin Kubasik
http://foio.blogspot.com



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]