Peeping at the future

From: Philip Van Hoof <spam pvanhoof be>
To: tinymail-devel-list gnome org
Cc: Dave Cridland <dave cridland net>
Subject: Peeping at the future
Date: Thu, 06 Dec 2007 23:51:55 +0100
Hi there fellow hackers,

This is some code that I have lying around, that will someday replace
the summary storage. Probably a few weeks or days after Tinymail 1.0
gets released, in Tinymail 2.0's then-new branch.

I kindly invite all the crazy people to check it out, investigate,
comment what could be better, etc.

I'll make a quick guide.

First let's repeat the story of the summary:

 o. The summary is that of a folder that you want to see when requesting
    an overview. This means: to, from, subject, cc, flags, size, uid.

 o. Because this data is quantitative 'large', it consumes most of your
    E-mail client's memory, unless you are smart. Tinymail tries to be
    smart by mmaping this data.

 o. This data is read often, changes seldom, has a lot of duplicate
    strings (really a lot), when it changes either it's an append,
    deleted or a flag change. Once appended, it never changes other than
    flag changes or deleted.

 o. Some numbers to give you an idea:

	o. 30,000 items consume on average 10 MB mmaped data (strings)
	o. 6 MB admin (pointers)
	o. If not using GStringChunk, add 2 MB heap admin to this
	o. Evolution triples these numbers (if not more)

Then, let's discuss the requirements, problems, details, ideas:

 o. The core idea is locality of memory (and mmap) data

	o. Mmap is fine and all, but if your data is spread around then
	   the kernel must map much more pages into real ram modules.

	   By putting the most referenced strings close together in the
	   beginning of the file, we make the kernel need to load less
	   pages. 

	   The aim of this is to reduce VmRSS size.

	o. Only unique strings are stored, saving disk space and
	   therefore also mmap size. Therefore less VM size.

	   The aim of this is to reduce the VmSize.

	o. Fewer pages that need to be accessed means fewer disk seeks.

	o. Fewer pages (in ram) that need to be accessed means fewer
	   operations on the databus (mostly interesting for mobiles)

  o. We'll need fewer writes of the summary data

	o. Right now rewriting the summary.mmap *IS* what makes Tinymail
	   slow when fetching a large folder (larger than 15,000 items,
	   you'll notice this). The solution is to work in blocks in
	   stead.

	o. Blocks (in this experiment code) are sized at 1000 items.
	   This will always be fast, even on slow devices

	o. The flags are put in a separate flat sequential file

	o. Wipes just get marked, when a lot of items are wiped, a
	   rewrite of the block is scheduled (only drastic rewrite
	   occasion). (a wipe is an expunge or vanish that got locally
	   synced)

	o. Appends means that a new block is created, in appending mode
	   (new items that got added) 

  o. Searches don't consume the memory and the mmap for an entire folder

	o. The blocks cause that when you search and you get summary
	   items, that the items can hold references on a block only, in
	   stead of needing to keep a reference on the entire folder's
	   summary mmap.

	   This makes it possible to do modest searches. Each hit will
	   just at least keep a block of 1000 loaded. If multiple hits
	   occur in one block, it's just one block with multiple
	   references in memory.


The solution: a three-file one.

Per block you have:
  o. An index
  o. A flags data file
  o. A mmap file

The index contains records like:

4 uid0 10 2048 94 88 84 80

This means: 
 o. The uid is 4 bytes
 o. The 4 bytes of the uid
 o. The sequence number is 10
 o. The size of the E-mail is 2048 octets
 o. The subject is at offset 94
 o. The from is at offset 88
 o. The to is at offset 84
 o. The cc is at offset 80

The flags data file contains records like:

10 18910

This means:

Message with sequence number 10 has flag = 18910

The data file has \0 delimited strings. The nice thing about this file
is that strings that got used must, are put in front of the file (the
file is sorted on usage). The index file's offsets are the amount of
bytes since the start of this data file.


Have fun reading code ...


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be
Attachment: mytest3.tar.gz
Description: application/compressed-tar
Follow-Ups:
- Re: Peeping at the future
  - From: Philip Van Hoof
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]