[BuildStream] Performance results, April 2019 - Let's discuss options...

From: Daniel Silverstone <daniel silverstone codethink co uk>
To: BuildStream <buildstream-list gnome org>
Subject: [BuildStream] Performance results, April 2019 - Let's discuss options...
Date: Thu, 11 Apr 2019 13:54:23 +0100
Hi all,

I apologise for how long this has taken to send out - I had issues with the
size of the results set (since I wanted to include January's results for
comparison purposes too).  As such, I ended up xz compressing a tarball since
that was half the size of a max-compression zip file.  Apologies to those
trapped on platforms which can't do .tar.xz.

So, without further ado...

Thank you all for submitting results of performance testing.  It has been both
frustrating and rewarding to go through them.  For the most part they all show
the same thing - The changes we've made since the gathering really have
improved over-all runtime of the tool, and excluding things like the YAML Cache
which now no longer exist, the distribution of time among the various parts of
the codebase, particularly pre-scheduler, seem fairly similar, just smaller,
indicating a pervasive improvement in implementation without large
architectural changes.

Next time we call for profiling, we'll change up the tests a bit, to introduce
other aspects such as sourcecache.  Before we do that though, we'll look to
automate things a little more so that you all don't have to follow a complex
manual process.

The scheduler runtime improvements are massively thanks to Tristan's work
earlier this year which improved matters drastically.  Sadly they hilight
various parts of the tool which we were somewhat aware of as being a
performance issue and now are glaringly obviously something we need to work on.

I am unable to identify any particular small things to improve and as such I
am going to raise only the larger architectural issues which we now face.  I
will attach the results to this email, so that when you go through them you can
raise any smaller things you spot which I may have missed while staring at the
big jobs.  Wood-for-trees and all that jazz.

UI rendering
------------

We already have a nice short-circuit in place so that if stdout/stderr are not
terminals, we do not do the fancy UI render.  However when we do it, it's worth
noting that around 50% of the straight-line cost of the parent `bst` process is
spent rendering and displaying UI widgets.  There are three ways I can think of
to potentially reduce the cost here.  One is simply to make the rendering less
complex, a second is to somehow introduce a maximum update rate of the UI, so
that we render fewer times, and the third (and probably most likely to be
generically successful) is to outsource the UI to a subprocess managed via the
scheduler to which we send the UI events in a non-blocking manner.

Forking job processes
---------------------

As raised at the gathering, but now even more of an obvious issue, the cost of
the job sub-processes as straight-line time in the parent process is becoming
an issue.  While I accept that in local build scenarios, the parent process is
unlikely to be a significant amount of the total runtime, in remote-execution
environments where ccache-equivalents might have a lot of the work pre-cached,
it is a potentially significant part of the time.  10% of the straight-line
runtime of the parent `bst` process is spent in the low levels of the python
interpreter running `fork()`.  Any way to reduce this percentage would be
super-useful.

While this percentage does not seem particularly high, it's worth noting that
if we can eliminate the UI costs in the direct path, the percentage goes up
to around 25% of runtime.

Fixing of element state querying
--------------------------------

As an example of a smaller but potentially still valuable optimisation, we
query the cached status of elements an awful lot.  For example, on my profile
of the build, we call `Element._cached_success()` a total of 1104441 for a 6198
element pipeline.  That's an average of 178 times per element.  While this
isn't theoretically *that* expensive (a mere 4% of runtime until the above
issues are improved), it is a clear example of where we might have some
architectural issues within Buildstream's handling of element progress through
the various states.  Not sure what to suggest here, other than "have more
thoughts as a group please"

Caching of element construction
-------------------------------

Even in the YAML New World Order, `Element._new_from_meta()` remains a
significant contributor to our runtime, totalling around 30% of runtime for the
`load-selection` profile.  We do an awful lot of work here which in theory
could be traced and then cached.  With the YAML-NWO we removed the YAML Cache
because the new parser outperformed the old cache.  However we've opened the
way to thinking about the migration of some of the logic of
`Element.__init__()` and `Source.__init__()` into the loader with a view to
being able to cache constructed `Variables` instances, resolved public data,
etc, so that we can simply avoid doing that work when we have already cached
it.  My view is that around 25% of runtime in the `load-selection` profile
is work which could be cached.  Given we know that Python's pickle module isn't
the fastest of beasts, we should think *very* carefully about caching methods,
but at minimum we have the potential for significant gains here.

I would recommend starting some research into how we might migrate the logic
out of `Element` and/or `Source` into `LoadElement/MetaElement` and/or
`MetaSource` such that we can determine just how much of the work we might be
able to cache for later re-use.  As part of this, we'd likely end up re-adding
a cache of the YAML parse, simply because it'll be necessary for preserving
provenance data.


Thanks for reading, I look forward to discussion with you all...

D.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69
Attachment: performance-results-jan-and-apr.tar.xz
Description: application/xz
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]