Hi there fellow hackers,
This is some code that I have lying around, that will someday replace
the summary storage. Probably a few weeks or days after Tinymail 1.0
gets released, in Tinymail 2.0's then-new branch.
I kindly invite all the crazy people to check it out, investigate,
comment what could be better, etc.
I'll make a quick guide.
First let's repeat the story of the summary:
o. The summary is that of a folder that you want to see when requesting
an overview. This means: to, from, subject, cc, flags, size, uid.
o. Because this data is quantitative 'large', it consumes most of your
E-mail client's memory, unless you are smart. Tinymail tries to be
smart by mmaping this data.
o. This data is read often, changes seldom, has a lot of duplicate
strings (really a lot), when it changes either it's an append,
deleted or a flag change. Once appended, it never changes other than
flag changes or deleted.
o. Some numbers to give you an idea:
o. 30,000 items consume on average 10 MB mmaped data (strings)
o. 6 MB admin (pointers)
o. If not using GStringChunk, add 2 MB heap admin to this
o. Evolution triples these numbers (if not more)
Then, let's discuss the requirements, problems, details, ideas:
o. The core idea is locality of memory (and mmap) data
o. Mmap is fine and all, but if your data is spread around then
the kernel must map much more pages into real ram modules.
By putting the most referenced strings close together in the
beginning of the file, we make the kernel need to load less
pages.
The aim of this is to reduce VmRSS size.
o. Only unique strings are stored, saving disk space and
therefore also mmap size. Therefore less VM size.
The aim of this is to reduce the VmSize.
o. Fewer pages that need to be accessed means fewer disk seeks.
o. Fewer pages (in ram) that need to be accessed means fewer
operations on the databus (mostly interesting for mobiles)
o. We'll need fewer writes of the summary data
o. Right now rewriting the summary.mmap *IS* what makes Tinymail
slow when fetching a large folder (larger than 15,000 items,
you'll notice this). The solution is to work in blocks in
stead.
o. Blocks (in this experiment code) are sized at 1000 items.
This will always be fast, even on slow devices
o. The flags are put in a separate flat sequential file
o. Wipes just get marked, when a lot of items are wiped, a
rewrite of the block is scheduled (only drastic rewrite
occasion). (a wipe is an expunge or vanish that got locally
synced)
o. Appends means that a new block is created, in appending mode
(new items that got added)
o. Searches don't consume the memory and the mmap for an entire folder
o. The blocks cause that when you search and you get summary
items, that the items can hold references on a block only, in
stead of needing to keep a reference on the entire folder's
summary mmap.
This makes it possible to do modest searches. Each hit will
just at least keep a block of 1000 loaded. If multiple hits
occur in one block, it's just one block with multiple
references in memory.
The solution: a three-file one.
Per block you have:
o. An index
o. A flags data file
o. A mmap file
The index contains records like:
4 uid0 10 2048 94 88 84 80
This means:
o. The uid is 4 bytes
o. The 4 bytes of the uid
o. The sequence number is 10
o. The size of the E-mail is 2048 octets
o. The subject is at offset 94
o. The from is at offset 88
o. The to is at offset 84
o. The cc is at offset 80
The flags data file contains records like:
10 18910
This means:
Message with sequence number 10 has flag = 18910
The data file has \0 delimited strings. The nice thing about this file
is that strings that got used must, are put in front of the file (the
file is sorted on usage). The index file's offsets are the amount of
bytes since the start of this data file.
Have fun reading code ...
--
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be
gnome: pvanhoof at gnome dot org
http://pvanhoof.be/blog
http://codeminded.be
Attachment:
mytest3.tar.gz
Description: application/compressed-tar