Hi all, I apologise for how long this has taken to send out - I had issues with the size of the results set (since I wanted to include January's results for comparison purposes too). As such, I ended up xz compressing a tarball since that was half the size of a max-compression zip file. Apologies to those trapped on platforms which can't do .tar.xz. So, without further ado... Thank you all for submitting results of performance testing. It has been both frustrating and rewarding to go through them. For the most part they all show the same thing - The changes we've made since the gathering really have improved over-all runtime of the tool, and excluding things like the YAML Cache which now no longer exist, the distribution of time among the various parts of the codebase, particularly pre-scheduler, seem fairly similar, just smaller, indicating a pervasive improvement in implementation without large architectural changes. Next time we call for profiling, we'll change up the tests a bit, to introduce other aspects such as sourcecache. Before we do that though, we'll look to automate things a little more so that you all don't have to follow a complex manual process. The scheduler runtime improvements are massively thanks to Tristan's work earlier this year which improved matters drastically. Sadly they hilight various parts of the tool which we were somewhat aware of as being a performance issue and now are glaringly obviously something we need to work on. I am unable to identify any particular small things to improve and as such I am going to raise only the larger architectural issues which we now face. I will attach the results to this email, so that when you go through them you can raise any smaller things you spot which I may have missed while staring at the big jobs. Wood-for-trees and all that jazz. UI rendering ------------ We already have a nice short-circuit in place so that if stdout/stderr are not terminals, we do not do the fancy UI render. However when we do it, it's worth noting that around 50% of the straight-line cost of the parent `bst` process is spent rendering and displaying UI widgets. There are three ways I can think of to potentially reduce the cost here. One is simply to make the rendering less complex, a second is to somehow introduce a maximum update rate of the UI, so that we render fewer times, and the third (and probably most likely to be generically successful) is to outsource the UI to a subprocess managed via the scheduler to which we send the UI events in a non-blocking manner. Forking job processes --------------------- As raised at the gathering, but now even more of an obvious issue, the cost of the job sub-processes as straight-line time in the parent process is becoming an issue. While I accept that in local build scenarios, the parent process is unlikely to be a significant amount of the total runtime, in remote-execution environments where ccache-equivalents might have a lot of the work pre-cached, it is a potentially significant part of the time. 10% of the straight-line runtime of the parent `bst` process is spent in the low levels of the python interpreter running `fork()`. Any way to reduce this percentage would be super-useful. While this percentage does not seem particularly high, it's worth noting that if we can eliminate the UI costs in the direct path, the percentage goes up to around 25% of runtime. Fixing of element state querying -------------------------------- As an example of a smaller but potentially still valuable optimisation, we query the cached status of elements an awful lot. For example, on my profile of the build, we call `Element._cached_success()` a total of 1104441 for a 6198 element pipeline. That's an average of 178 times per element. While this isn't theoretically *that* expensive (a mere 4% of runtime until the above issues are improved), it is a clear example of where we might have some architectural issues within Buildstream's handling of element progress through the various states. Not sure what to suggest here, other than "have more thoughts as a group please" Caching of element construction ------------------------------- Even in the YAML New World Order, `Element._new_from_meta()` remains a significant contributor to our runtime, totalling around 30% of runtime for the `load-selection` profile. We do an awful lot of work here which in theory could be traced and then cached. With the YAML-NWO we removed the YAML Cache because the new parser outperformed the old cache. However we've opened the way to thinking about the migration of some of the logic of `Element.__init__()` and `Source.__init__()` into the loader with a view to being able to cache constructed `Variables` instances, resolved public data, etc, so that we can simply avoid doing that work when we have already cached it. My view is that around 25% of runtime in the `load-selection` profile is work which could be cached. Given we know that Python's pickle module isn't the fastest of beasts, we should think *very* carefully about caching methods, but at minimum we have the potential for significant gains here. I would recommend starting some research into how we might migrate the logic out of `Element` and/or `Source` into `LoadElement/MetaElement` and/or `MetaSource` such that we can determine just how much of the work we might be able to cache for later re-use. As part of this, we'd likely end up re-adding a cache of the YAML parse, simply because it'll be necessary for preserving provenance data. Thanks for reading, I look forward to discussion with you all... D. -- Daniel Silverstone https://www.codethink.co.uk/ Solutions Architect GPG 4096/R Key Id: 3CCE BABE 206C 3B69
Attachment:
performance-results-jan-and-apr.tar.xz
Description: application/xz