Re: Last version of the new summary storage engine
- From: Philip Van Hoof <spam pvanhoof be>
- To: Dirk-Jan Binnema nokia com
- Cc: tinymail-devel-list gnome org
- Subject: Re: Last version of the new summary storage engine
- Date: Mon, 10 Dec 2007 14:12:55 +0100
Dirk-Jan Binnema nokia com wrote:
The mantra "using fewer memory, results in a faster
application" is very
true in our case.
Sure -- my point was that really that for any optimization
we do, we can think of some reason why it would *really*
make things a *lot* faster. But experience shows that
it's very easy to fool ourselves. In other words, while
the discussion *sounds* reasonable, I'm not going to
believe before I see it :)
Agree.
So you're only taking full strings (ie., the whole From: address
or Subject:) and no substrings.
Indeed. I abandoned the substring idea as that would indeed take too
much time to calculate. Perhaps something for the future in case even
this isn't memory-optimized enough. But really, I don't think we'll have
users wanting to use 500,000 summary items on a phone before phones have
a gigabyte of RAM by default too.
Although you will most certainly have Ubuntu bugzilla maintainers
claiming they easily need a few times 100,000 items for showing their
typical IMAP folder on their phone ;-), I shouldn't say this but still
... I don't think those people can be categorized as "the common user
having a common use-case".
In reality do the vast amount of people have folders sized 1,000 to
10,000 items. We must indeed optimize for those numbers (and not for
the rare 100,000 cases).
Still, using hashtables adds some time. Suppose we have 1000 emails.
1) It will take extra time because of the hash-table.
2) It might save some time because it uses less memory, cache etc.
Depending on the number of duplicates, 1) or 2) will be dominant.
If you don't have many duplicate strings, it's likely to only
slow things down. The question is -- when does it become faster?
This is something to measure. In both cases the speed was, however,
faster. The largest speed gain happens because with this summary store
we store in blocks of 1,000 items, rather than writing the entire
content each time. That's faster because summary content rarely changes:
we're only writing out the new information, while we preserve the
existing information.
Well, this all depends on the number of times a string is
duplicated. In many cases, there won't be that many. Let's be
careful not to optimize for some unusual use case, while making
the normal use case slower.
The normal use case is that summary information rarely changes. The
1,000 items block will only be written when 1,000 new items are
available. Right now we have to rewrite all of the items no matter what
(this is what makes downloading a large folder of 15,000 items or more
too slow). Writing only one block and preserving the existing blocks,
means only writing 1,000 items. In stead of all.
This makes the normal use-case guaranteed faster too. Although I could
also adapt the current solution to cope with fopen (path, "a") or
appended writing. This will be difficult to make atomic, though (right
now a rename() is an easy way to make things atomic while the original
file stays mapped until the very last moment just before the rename()
takes place).
Ironically we also write a smaller mapped file because we have fewer
strings to write (we made them unique): writing is slower
than searching
for duplicates. And that was a test on a fast harddisk, not even on a
ultra slow flashdisk. Although, indeed, a tablet's memory
access will be
slower than a PC's memory access too (it's therefore not a guarantee,
but we'll test this of course).
Flash disks are not that bad actually. And seeking is not as
expensive.
I tested this with 100,000 items (with 50% duplicates, this
is very much
like the Linux mailing list's folder): this was actually faster than
writing 50,000 items using the current summary format.
Could you try with a 'normal' folder? A little cmdline tool to
show the amount of duplicates could be quite nice...
Common use-case:
Because the "To" will always be identical in INBOX, you will always have
~25% duplicate strings (you have ~ 4 strings to deal with). Add to that
your favorite "From" addresses, which will probably account for ~2%
duplicates, and thread-subject fields, ~1 or ~2% duplicates. Spam will
be quite unique in comparison to legitimate E-mails as even the "From"
field is usually different (the full name part of it).
Of course.
Excellent. I hope the optimization works as you think of course,
I am just being constructively-skeptical :)
Thanks for that.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]