Last version of the new summary storage engine



This latest version of the future summary store uses a hashtable to
identify the duplicates. As expected makes it the writing of the summary
a lot faster (hashtables make a hash of each added key-string, which
makes searching-for-dups a lot faster then my former quickly made doubly
linked list code for this).

I'm still hoping to somehow avoid the g_strdup in the code that dumps
(writes) the summaryblock. Making copies of strings is not really fast.

I added the necessary locking. The reference counting of the
summaryblocks is not 100% correct, yet. Although you should get the
idea:

  o. Not individual summaryitems but its container summaryblock gets a
     reference added. 

  o. Initial load of a summary, which is a container for summaryblocks,
     creates all summaryitems which will each add one reference to their
     container summaryblock

  o. Unload of a summary unloads each summaryblock, which merely means
     taking the one reference of each of its contained summaryitems.

  o. Requesting a summaryitem from the summary adds a reference to the
     container summaryblock of the item.

  o. Stopping usage of that summaryitem takes away the reference on the
     container summaryblock

  o. A summaryblock can live without its container summary
                    ***
  o. A summaryitem can't live without its container summaryblock
                   *****
  o. A summaryblock contains on average 1000 items

  o. A summary contains (Item-Amount / 1000) summaryblocks


I'll repeat the core ideas of this summary store implementation:

  o. Localisation of mapped memory: often needed strings are placed in
     the beginning of the mmap()ed file (they are sorted on usage-count
     by the summaryitems)

     -> Current summary storage scatters strings around, no matter how
        often they are used (they are simply duplicated each time in
        stead resulting in a larger VmSize and in more mapping/swapping
        operations that have to be done by the kernel)

  o. Duplicate strings are made unique in the mmap

     -> Current summary storage scatters duplicate strings around in the
        mapping. Making the mapping more random (less localised memory
        access), more redundant (larger VmSize).

  o. Blocks of 1000 items that can function individually (per block)

  o. Quick at appending new material: a new writable block is created
     and the current block is written (if 1000 is reached)

     -> Current summary storage needs to rewrite the entire summary.mmap
        file each 5000th item that gets added. This gets hideously slow
        starting at 15,000 items. Especially on mobile ARM devices.

  o. Quick at changing the flags of items: a separate file per block is
     utilised to store the item's flags. Only this sequential file needs
     to be rewritten when flags of the item(s) changed.

     -> This is equally slow once you are at 15,000 items.

  o. Fast searching: because of ideal localisation will often requested
     strings be mapped in real memory modules (by the kernel).

  o. Fast but low-memory browsing: currently visible items will be
     mapped in real memory modules (by the kernel), items that didn't
     get requested aren't mapped in. Not even touched at initial scan
     stage.



-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be



Attachment: mytest5.tar.gz
Description: application/compressed-tar



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]