Proposal: JSON for artifact metadata instead of YAML
- From: Angelos Evripiotis <angelos evripiotis gmail com>
- To: buildstream-list gnome org
- Subject: Proposal: JSON for artifact metadata instead of YAML
- Date: Tue, 31 Oct 2017 12:39:26 +0000
Hi!
I'd like to propose that we use the Python library json module for saving and
loading artifact metadata, instead of BuildStream's _yaml module.
I think we should do this mostly for performance and a bit for security.
Here 'artifact metadata' means the data that is part of a build artifact, and
stored in the 'meta' directory. e.g. all these files:
~/.cache/buildstream/artifacts/extract/project/element/22ba6/meta/*.yaml
I think _yaml is very good for configuration, i.e. .bst files project.conf and
am not suggesting that we change that.
As far as I can tell, JSON supports the current and proposed uses for artifact
metadata. If adopted, I imagine we'd have some period of backwards
compatability before dropping support for the old .yaml metadata files (pre
v1). I also volunteer to undertake the work to do it before v1 (this week?).
Here are the considerations I made.
Consistency
-----------
Consistency is good, ideally we wouldn't use a different format for each thing.
Use-cases should be quite divergent before you consider adding another format
to worry about. This is a point against JSON for artifact metadata.
Performance
-----------
This is the main motivation for this proposal, with the idea that we
will need to
work with platforms that have on the order of 100,000 files.
buildstream._yaml.load() is significantly slower than json.load():
# json saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import json; f = open("data.json", "w");
json.dump(data, f); f.close();'
# real 0m0.463s
# user 0m0.440s
# sys 0m0.010s
# json loading
time python3 -c 'import json; f = open("data.json"); loaded =
json.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real 0m0.094s
# user 0m0.080s
# sys 0m0.010s
# _yaml saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import buildstream; buildstream._yaml.dump(data,
"data.yaml");'
# real 0m15.008s
# user 0m14.730s
# sys 0m0.160s
# _yaml loading
time python3 -c 'import buildstream; loaded =
buildstream._yaml.load("data.yaml"); print("\n".join(x["filename"] for
x in loaded["files"]));' > data.txt
# real 2m58.688s
# user 2m58.400s
# sys 0m0.260s
As discussed on issue 82 [8], there are faster loading options available but we
would have to use an interface that's not considered stable yet. Here's the
times for that:
# YAML(typ="safe") saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml", "w"); yaml.dump(data, f);
f.close();'
# real 0m2.553s
# user 0m2.480s
# sys 0m0.060s
# YAML(typ="safe") loading
time python3 -c 'from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml"); loaded = yaml.load(f);
f.close(); print("\n".join(x["filename"] for x in loaded["files"]));'
data.txt
# real 0m2.787s
# user 0m2.610s
# sys 0m0.170s
Unfortunately that's still relatively slow.
Security
--------
Artifact metadata is intended to be shared between machines, using 'bst pull'
and friends. This means users may share arbitrary data with each-other, which
is unlikely to be scrutinized. Code will execute against the data outside the
sandbox. This seems a possible target for exploits.
YAML is smart, JSON is simple. Full YAML in general poses some security
concerns because it allows for the instantiation of arbitrary objects. Loading
without restrictions has led to a number of CVEs [1], [2], [3], [4], [5]. The
RoundTripLoader used by _yaml seems to be of the restricted variety, which is
good.
Python's json is small [6], _yaml is big. The problem of round-trip parsing
adds complexity to the parsing problem and necessarily ruamel [7] and _yaml
together are quite large, which seems to offer a bigger target.
I'm guessing that Python's json parser has had more scrutiny than _yaml and
dependencies have had, given its smaller size and being more in the mainstream.
References
----------
[1]: https://nvd.nist.gov/vuln/detail/CVE-2013-0156 (Ruby on rails)
[2]: https://nvd.nist.gov/vuln/detail/CVE-2016-4972 (OpenStack)
[3]: https://nvd.nist.gov/vuln/detail/CVE-2017-2292 (MCollective)
[4]: https://nvd.nist.gov/vuln/detail/CVE-2017-2295 (Puppet)
[5]: https://nvd.nist.gov/vuln/detail/CVE-2017-2810 (TabLib)
[6]: https://github.com/python/cpython/blob/3.6/Modules/_json.c
[7]: https://bitbucket.org/ruamel/yaml/src/
[8]: https://gitlab.com/BuildStream/buildstream/issues/82
P.S. I also tried with pyyaml out of curiosity:
time python3 -c 'import yaml; f = open("data.yaml"); loaded =
yaml.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real 0m18.477s
# user 0m18.240s
# sys 0m0.220s
# Oops, remember that 'safe_load()' is what we're really interested in ;)
time python3 -c 'import yaml; f = open("data.yaml"); loaded =
yaml.safe_load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real 0m18.385s
# user 0m18.220s
# sys 0m0.160s
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]