[BuildStream] Cache key instability
- From: Daniel Silverstone <daniel silverstone codethink co uk>
- To: BuildStream <buildstream-list gnome org>
- Subject: [BuildStream] Cache key instability
- Date: Thu, 14 Feb 2019 11:09:21 +0000
Hi all,
There is an issue with instability in the way we currently generate cache
keys. It's possible that some of you have hit this in the past, but I
think it's something we need to resolve pretty much now.
Currently we generate our cache keys thusly:
def generate_key(value):
ordered = _yaml.node_sanitize(value)
string = pickle.dumps(ordered)
return hashlib.sha256(string).hexdigest()
As you can see, we rely on the pickle module to generate the byte stream
which is hashed to form the cache key. Sadly this is not stable. For
example, I am working on a replacement for the way Variables are expanded
and in doing that, I ended up with a number of interned strings to reduce
the memory impact of the rework. This meant that despite the *value* of
the `ordered` cache key dictionary not changing, the pickled `identity`
of it did change and thus the cache keys changed. This was an unexpected
and unwanted side-effect of the work I was doing.
I have spoken with Benjamin Schubert about this in the past and we decided
that it would make sense to not use pickle here anyway because the byte
code is not necessarily stable from python release to python release. (While
new pickle modules can decode old ones, it's possible they might introduce
changes which mean that the encoding may change in new Python releases).
We discussed and decided that a lightweight replacement for the pickling
might be to use JSON. There are any number of JSON implementations which
might be selected, some may be faster than pickle, some may be slower, but
importantly, all ought to be stable in terms of how the data structure is
encoded into JSON.
I propose that we rev the cache key version, and switch from pickling to JSON
(though I'd entertain a ruamel dump too if that's quick enough) and am seeking
support for this. Indeed I intend to file an MR when I have written and tested
a change to use the built-in JSON implementation providing it isn't a
*significant* performance impact short-term to do so.
Regards,
Daniel.
--
Daniel Silverstone https://www.codethink.co.uk/
Solutions Architect GPG 4096/R Key Id: 3CCE BABE 206C 3B69
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]