Re: Automated benchmarks of BuildStream



On Fri, 2018-03-23 at 10:08 +0000, Jim MacArthur wrote:
[...]
The goal I have in my mind is to have a directory produced by the GitLab 
CI system which contains various artifacts of the benchmarking process. 
JSON output from both benchmarks and log scraping will be the primary 
form of data, and we'll also be able to produce visual indicators in 
HTML and SVG or PNG, along with nicely-formatted tables, and CSV format 
for pasting into a spreadsheet to do your own ad-hoc analysis.

Hi Jim,

So I'm sorry if my recent reply here sounded "short", a lot of this
benchmarking initiative has happened out of sight, and as such it's
impossible for me to know how much effort has been invested in this so
far.

Since I feel that since it's inception, benchmarks has veered off
course at least once, as Sam and I did not see exactly eye to eye on
this from the beginning - it worries me if a lot of effort is being
spent without necessarily being on the same page; I think we fixed that
in our discussions at the hackfest before FOSDEM, lets try to make sure
we remain on the same page.

First, here is the material I was able to gather on the subject:

  Angelos's original email in November, which appears to be a reply
  to me, which I can no longer find the origin:
  https://mail.gnome.org/archives/buildstream-list/2017-November/msg00001.html

  It's worth reading through the above thread, but here are some of my
  replies in that thread regardless:
  https://mail.gnome.org/archives/buildstream-list/2017-November/msg00005.html
  https://mail.gnome.org/archives/buildstream-list/2017-November/msg00017.html

  Sam's announcement of the beginnings of the benchmarks repo:
  https://mail.gnome.org/archives/buildstream-list/2018-February/msg00012.html

  A flagship issue in the buildstream repo:
  https://gitlab.com/BuildStream/buildstream/issues/205

  And a README in the benchmarks repo:
  https://gitlab.com/BuildStream/benchmarks


With a re-read of the above things, I *think* we are *mostly* on the
same page here, regarding:

  o This is something standalone that a developer can:
    - Run on their laptop
    - Render and view the results
    - Select which parts of the benchmarks they want to run, for
      quicker observation of the imacts of their code changes

  o Leveraging of BuildStream logging which already features the
    timing of "things" we would want to analyze in benchmarks, in
    order to reduce the observer effect.

So far so good. One thing I'm more concerned about is what exactly we
are measuring, and how we set about measuring that; what exactly do we
want to be observing ?

When we say "I want to observe the time it takes to stage files" or...
"I want to observe the time it takes to load .bst files" I want to
observe *time per record* for each "thing" we want to benchmark, I want
to observe if we handle this in linear time or not, and I want to
compare that across versions of BuildStream.

When I see in the above linked README file:

  "Configurable aspects:

     * Scale of generated projects, e.g. 1 file, 10 files, 100 files...
       lots of data points allow analyzing how a feature scales, but
       also means we have lots of data.
     ...
  "

This raises a flag for me, rather I am interested in seeing the results
of every run of N items, where N is incrementing, in one graph, and
this really should be default (if it is configurable, it leads me to
suspect we cannot observe non-linear operations for a single run of
benchmarks).

That said, configuring an upper bound limit on which numbers we want to
test (or a list of numbers of records) is interesting such as we avoid
*requiring* that a developer run the benchmarks for hours and hours.

Ultimately, what I want to see for a given "thing" that we measure, be
it loading a .bst file, staging a file into a sandbox, caching an
artifact, or whatnot; is always a time per record.

Allow me some ascii art here to illustrate more clearly what I am
hoping to see:

                        Loading bst files
 40ms +------------------------------------------------------------+
      |                                                            |
      |                                                            |
 20ms |                                                            |
      |                                                            |
      |                                                            |
 10ms |                                                     o      |
      |                                       o                    |
      |   o          o           o                                 |
  0ms +------------------------------------------------------------+
          |          |           |            |             |
      (1 file)  (10 files)  (100 files) (1,000 files) (10,000 files)


In the above, we would have some lines connecting the dots, probably
due to the recursive operations we need to run for circular dependency
detection, and presorting of dependencies on each element; this
function will most likely be non linear, but we would ultimately want
to make it linear.

Each sample represented here is the sample of the "Time it took to load
the whole project, divided by the number of elements being loaded",
where "Time to load the project" is isolated, does not include python
startup time or the time it takes to say, display the pipeline in `bst
show` (so we need to use the log parsing to isolate these things).

We can have multiple versions of BuildStream rendered into the same
graph above, with a legend depicting which color is used to depict the
version of BuildStream being sampled, so we can easily see how a code
change has effected the performance of a given "thing".

If we introduce randomization of data sets in here, which may be
important for generating more realistic data sets (for instance, it is
not meaningful to run the above test on a single target which directly
depends on 10,000 bst files; we need some realistic "depth" of
dependencies) - then it becomes important to rerun the same sample of
say "10 files" many many times (with different randomized datasets),
and observing the average of those totals.

In the future, we can also extend this to plot out separate graphs for
memory consumption, however for accurate readings, we will need to make
some extensions to BuildStream's logging Message objects to enable
memory consumption snapshots to optionally be observed and reported at
the right places.

At a high level, it's important to keep in mind that we use benchmarks
to identify bottlenecks (and then later we use profiling to inspect the
identified bottleneck to then optimize them).

While at the same time, it can be interesting in some way to observe
something simple like "The time it took to complete a well known build"
of something, such numbers are not useful for identifying bottlenecks
and improving things (they can only act as a global monitor of how
well, or how badly we perform).


In closing: I suspect that we are mostly all on the same page as to
what we are doing with the benchmarks initiative, but since it seems to
me I have had a hard time to communicate this in the past, and since I
have not had feedback in a long time and cannot measure the efforts
being spent here, I just feel that we have to make sure we are really
still on the same page here once more.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]