Re: Proposal: JSON for artifact metadata instead of YAML

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Angelos Evripiotis <angelos evripiotis gmail com>, buildstream-list gnome org
Subject: Re: Proposal: JSON for artifact metadata instead of YAML
Date: Wed, 01 Nov 2017 17:04:18 +0900

On Tue, 2017-10-31 at 12:39 +0000, Angelos Evripiotis wrote:

Hi!

I'd like to propose that we use the Python library json module for saving and
loading artifact metadata, instead of BuildStream's _yaml module.


Hi Angelos,

Needless to say I think; I dislike this very very much.

Don't consider this as a flat out "No", but I do think this is
premature, I feel that before we make such an undesirable change we
should evaluate this properly - some of this proposal is founded in
speculation and we should think more before we make this trade off.

In advance of my verbose replies, I'm going to propose to you an
alternative and IMO more productive course of action - you obviously
take performance very seriously, and I would also like to, let's do it
properly then.

To this end, I suggest that you do the following:

  o Collect some use cases where performance is critical, I mean use
    cases that are visible from the BuildStream CLI frontend.

    E.g. `bst build`, `bst checkout`, etc

    What about cases where `bst build` is using import elements, vs
    compose elements, etc.

  o Create a benchmarks test suite based on these frontend toplevel
    cases, not based on internal operations.

    I would like to see graphs rendered from collected data points,
    using simulated projects:

      o What is the "per element" build time for a simulated project
        with only import elements.

        What does a graph look like when simulating one element vs
        500 elements vs 1000 elements vs 10,000 elements, etc

      o How does the above graph compare with an import element which
        stages 1 file, vs 10, 100, 10,000, 100,000 files ?

    Further, we should re-run the whole benchmarks for every publicly
    released BuildStream version over time. How does 1.0 compare to
    1.1, 1.2, etc - Have we regressed or improved performance ?

    Lets really know about it.


Comments in line below...

I think we should do this mostly for performance and a bit for security.

Here 'artifact metadata' means the data that is part of a build artifact, and
stored in the 'meta' directory. e.g. all these files:

  ~/.cache/buildstream/artifacts/extract/project/element/22ba6/meta/*.yaml

I think _yaml is very good for configuration, i.e. .bst files project.conf and
am not suggesting that we change that.

As far as I can tell, JSON supports the current and proposed uses for artifact
metadata. If adopted, I imagine we'd have some period of backwards
compatability before dropping support for the old .yaml metadata files (pre
v1). I also volunteer to undertake the work to do it before v1 (this week?).


Backwards compatibility is not a concern, at, all. Artifact format is
not stable for 1.0, nor is there even any public entry point in 1.0 API
for extracting a full artifact and looking at the metadata.

If people go poking into the private buildstream caches and things
break, that is entirely their own problem.

For the duration of the 1.1 dev cycle at *least*, we are free to change
artifact format as frequently as we want, this is simply a matter of
bumping the core artifact version which ensures that new cache keys are
created and BuildStream wont ever load an artifact it doesn't
understand - simple as that.

Here are the considerations I made.

Consistency
-----------

Consistency is good, ideally we wouldn't use a different format for each thing.
Use-cases should be quite divergent before you consider adding another format
to worry about. This is a point against JSON for artifact metadata.


Lets elaborate a bit on what "Consistency" means for us...

Artifact format (i.e. the layout and content of an artifact) is not
stable now, nor will it be for 1.0, but there have been some decent
arguments made for considering making artifact format stable, this
might even happen as early as 1.2 (next stable BuildStream).

This means that whatever decisions we make now regarding artifact
format, will follow us *forever* - and this inconsistency will become
inconvenient to users.

We already have some outputs which are yaml format intended for machine
readability, e.g. `bst workspace list` outputs yaml because it was
considered that some projects may want to automate workspace creation,
introspection and deletion for CI purposes - similarly `bst show`
outputs yaml for some things, and this is also a focal point for
scripting and automation around BuildStream.

If we introduce JSON in a part of our public API surface (which
artifact metadata will inevitably become if we make artifact format
stable and expose it to external tooling), that means that projects
which consume BuildStream will have to use multiple parsers and
understand multiple formats to interact with BuildStream and script
around it, forever.

This significantly detracts from the simplicity / ease of use of
BuildStream as a whole, and as such it is a highly undesirable outcome.

Performance
-----------

This is the main motivation for this proposal, with the idea that we
will need to
work with platforms that have on the order of 100,000 files.


As I understand it, your primary motivation is exactly this, 100,000
file manifests.

Your case is that for a single artifact with 100,000 files, parsing
YAML is more costly than parsing JSON, this by itself is a strong
point.

Since we are talking about an API surface which will become stable, I
think we have to ask ourselves:

  "Is loading 100,000 file manifests slower to parse with existing YAML
   parsers than with existing JSON parsers ? Or is YAML really
   inherently that much more difficult to parse over JSON ?

   Is this something very silly, like for instance are we comparing
   a python parser implementation with a python binding to an
   underlying, optimized json-c library ?"

I believe we would be doing ourselves a great disservice to be making a
decision as permanent and undesirable as this, just because it would
cost a bit more up front to get load speeds on par "right now". Doing
things "nicely" is "hard", yes - is it so "hard" that we have to
sacrifice "nice", permanently ?


There are a lot of other things to consider here:

 o I highly doubt that the order of magnitude of YAML over JSON
   load times is a linear increase - was this taken into
   consideration, or measured, even ?

   My gut tells me that the load time of one list element is increasing
   with the size of your data set - which might only be justified in
   cases where "&" and "*" operators actually start appearing in the
   parsed YAML stream - this would indicate that it is possible to
   achieve similar load times to JSON at least in the cases where the
   referencing "&" and "*" do not appear.

 o How does this really impact the BuildStream session ?

   - Which kind of BuildStream sessions necessitate loading the
     metadata, or if they need to load the metadata, do they also
     have to load the manifest ?

     Looks like we might need this mostly for `bst build` and
     for `bst checkout`, most other commands dont need this operation.

   - In the case of `bst build` - my gut is telling me that we're
     talking about a manifest that we can load once, after which point
     it will already be on the heap.

     Realistically, you might have a couple mega huge artifacts to
     stage in the course of a build, they need to be staged hundreds
     of times but only need to be parsed once.

     So, are we optimizing a build that takes around 4 hours to
     complete, by a couple of minutes ?

     Would shaving off a couple of minutes from a 4 hour build really
     be worth this inconsistency ?

   Before coming to drastic conclusions here, we should be analyzing
   the data and running benchmarks on the BuildStream frontend against
   simulated data sets which resemble real world use cases.

 o You have also left out the part where, apparently performance is
   quite decent with the newer YAML interfaces from ruamel.yaml, so
   long as we stay away from loading dictionaries:

     https://gitlab.com/BuildStream/buildstream/issues/82#note_44846694

   Which means to me, that denormalization of the metadata, at least
   the manifest, is an option that we can consider before resorting to
   introducing a second data format into potentially public API.

   If we had e.g. manifest.yaml separate from manifest_attrs.yaml, we
   could get away with avoiding dictionaries for one (which is probably
   still a cheap workaround to a problem that is better addressed
   in the ruamel.yaml library) - but we also gain a possibility of
   lazy loading.

   Maybe we only need to know the full list at times, maybe we only
   need to know the attributes of things at other times.

 o We currently dont even have a manifest in play, which just adds
   more to how premature this proposed change is.

buildstream._yaml.load() is significantly slower than json.load():

# json saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import json; f = open("data.json", "w");
json.dump(data, f); f.close();'
# real    0m0.463s
# user    0m0.440s
# sys     0m0.010s

# json loading
time python3 -c 'import json; f = open("data.json"); loaded =
json.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real    0m0.094s
# user    0m0.080s
# sys     0m0.010s

# _yaml saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import buildstream; buildstream._yaml.dump(data,
"data.yaml");'
# real    0m15.008s
# user    0m14.730s
# sys     0m0.160s

# _yaml loading
time python3 -c 'import buildstream; loaded =
buildstream._yaml.load("data.yaml"); print("\n".join(x["filename"] for
x in loaded["files"]));' > data.txt
# real    2m58.688s
# user    2m58.400s
# sys     0m0.260s

As discussed on issue 82 [8], there are faster loading options available but we
would have to use an interface that's not considered stable yet. Here's the
times for that:

# YAML(typ="safe") saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml", "w"); yaml.dump(data, f);
f.close();'
# real    0m2.553s
# user    0m2.480s
# sys     0m0.060s

# YAML(typ="safe") loading
time python3 -c 'from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml"); loaded = yaml.load(f);
f.close(); print("\n".join(x["filename"] for x in loaded["files"]));'

data.txt


# real    0m2.787s
# user    0m2.610s
# sys     0m0.170s

Unfortunately that's still relatively slow.

Security
--------

Artifact metadata is intended to be shared between machines, using 'bst pull'
and friends. This means users may share arbitrary data with each-other, which
is unlikely to be scrutinized. Code will execute against the data outside the
sandbox. This seems a possible target for exploits.


The moment that you cannot trust the provenance of the artifact you
have received, you already have a problem.

Once you can trust the provenance of the artifact you are receiving,
all of your security concerns become entirely moot.

Yes, the current state of things are fragile, we rely on setups where
we download artifacts over https and only allow uploads from authorized
builders over ssh - A proper solution will be to add GPG signing as a
first class citizen of BuildStream, and ensure trust this from end to
end.

Cheers,
    -Tristan

Follow-Ups:
- Re: Proposal: JSON for artifact metadata instead of YAML
  - From: Angelos Evripiotis

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]