Proposal: JSON for artifact metadata instead of YAML

From: Angelos Evripiotis <angelos evripiotis gmail com>
To: buildstream-list gnome org
Subject: Proposal: JSON for artifact metadata instead of YAML
Date: Tue, 31 Oct 2017 12:39:26 +0000

Hi!

I'd like to propose that we use the Python library json module for saving and
loading artifact metadata, instead of BuildStream's _yaml module.

I think we should do this mostly for performance and a bit for security.

Here 'artifact metadata' means the data that is part of a build artifact, and
stored in the 'meta' directory. e.g. all these files:

  ~/.cache/buildstream/artifacts/extract/project/element/22ba6/meta/*.yaml

I think _yaml is very good for configuration, i.e. .bst files project.conf and
am not suggesting that we change that.

As far as I can tell, JSON supports the current and proposed uses for artifact
metadata. If adopted, I imagine we'd have some period of backwards
compatability before dropping support for the old .yaml metadata files (pre
v1). I also volunteer to undertake the work to do it before v1 (this week?).

Here are the considerations I made.

Consistency
-----------

Consistency is good, ideally we wouldn't use a different format for each thing.
Use-cases should be quite divergent before you consider adding another format
to worry about. This is a point against JSON for artifact metadata.

Performance
-----------

This is the main motivation for this proposal, with the idea that we
will need to
work with platforms that have on the order of 100,000 files.

buildstream._yaml.load() is significantly slower than json.load():

# json saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import json; f = open("data.json", "w");
json.dump(data, f); f.close();'
# real    0m0.463s
# user    0m0.440s
# sys     0m0.010s

# json loading
time python3 -c 'import json; f = open("data.json"); loaded =
json.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real    0m0.094s
# user    0m0.080s
# sys     0m0.010s

# _yaml saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import buildstream; buildstream._yaml.dump(data,
"data.yaml");'
# real    0m15.008s
# user    0m14.730s
# sys     0m0.160s

# _yaml loading
time python3 -c 'import buildstream; loaded =
buildstream._yaml.load("data.yaml"); print("\n".join(x["filename"] for
x in loaded["files"]));' > data.txt
# real    2m58.688s
# user    2m58.400s
# sys     0m0.260s

As discussed on issue 82 [8], there are faster loading options available but we
would have to use an interface that's not considered stable yet. Here's the
times for that:

# YAML(typ="safe") saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml", "w"); yaml.dump(data, f);
f.close();'
# real    0m2.553s
# user    0m2.480s
# sys     0m0.060s

# YAML(typ="safe") loading
time python3 -c 'from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml"); loaded = yaml.load(f);
f.close(); print("\n".join(x["filename"] for x in loaded["files"]));'

data.txt

# real    0m2.787s
# user    0m2.610s
# sys     0m0.170s

Unfortunately that's still relatively slow.

Security
--------

Artifact metadata is intended to be shared between machines, using 'bst pull'
and friends. This means users may share arbitrary data with each-other, which
is unlikely to be scrutinized. Code will execute against the data outside the
sandbox. This seems a possible target for exploits.

YAML is smart, JSON is simple. Full YAML in general poses some security
concerns because it allows for the instantiation of arbitrary objects. Loading
without restrictions has led to a number of CVEs [1], [2], [3], [4], [5]. The
RoundTripLoader used by _yaml seems to be of the restricted variety, which is
good.

Python's json is small [6], _yaml is big. The problem of round-trip parsing
adds complexity to the parsing problem and necessarily ruamel [7] and _yaml
together are quite large, which seems to offer a bigger target.

I'm guessing that Python's json parser has had more scrutiny than _yaml and
dependencies have had, given its smaller size and being more in the mainstream.

References
----------

[1]: https://nvd.nist.gov/vuln/detail/CVE-2013-0156 (Ruby on rails)
[2]: https://nvd.nist.gov/vuln/detail/CVE-2016-4972 (OpenStack)
[3]: https://nvd.nist.gov/vuln/detail/CVE-2017-2292 (MCollective)
[4]: https://nvd.nist.gov/vuln/detail/CVE-2017-2295 (Puppet)
[5]: https://nvd.nist.gov/vuln/detail/CVE-2017-2810 (TabLib)
[6]: https://github.com/python/cpython/blob/3.6/Modules/_json.c
[7]: https://bitbucket.org/ruamel/yaml/src/
[8]: https://gitlab.com/BuildStream/buildstream/issues/82

P.S. I also tried with pyyaml out of curiosity:

time python3 -c 'import yaml; f = open("data.yaml"); loaded =
yaml.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real    0m18.477s
# user    0m18.240s
# sys     0m0.220s

# Oops, remember that 'safe_load()' is what we're really interested in ;)
time python3 -c 'import yaml; f = open("data.yaml"); loaded =
yaml.safe_load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real    0m18.385s
# user    0m18.220s
# sys     0m0.160s

Follow-Ups:
- Re: Proposal: JSON for artifact metadata instead of YAML
  - From: Sam Thursfield

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]