Hi,
Didn't forget this, but was off last week...
On Fri, 2018-09-21 at 19:11 +0100, Sander Striker wrote:
> Hi,
>
> Reply to self, to try and clarify some things.
>
> > On Mon, Sep 17, 2018 at 11:37 PM Sander Striker <s striker striker nl> wrote:
> > Hi,
> >
> >
> > On Sun, Sep 16, 2018 at 9:59 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:
> > > Hi Sander,
> > >
> > > This one seems to keep falling through the cracks, picking it back up ;-)
> >
> > It is one of those threads :)
> >
> > > On Wed, 2018-09-12 at 15:57 +0100, Sander Striker wrote:
> > > > Hi,
> > > >
> > > > On Tue, Sep 11, 2018 at 12:26 PM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:
> > > > > On Mon, 2018-09-10 at 17:01 +0200, Sander Striker wrote
> > > >
> > > > [...]
> > > > > > I am assuming the $(cache_key) is actually unique enough on its own?
> > > > > > And this name is purely symbolic, contextualized for a project?
> > > > > > Nothing would require this under the hood? As in, I wouldn't expect
> > > > > > a remote artifact cache to carry the $(project-name)/$(element-name)
> > > > > > as part of its keys(?).
> > > > >
> > > > > Since the beginning, the refs we use to store / retrieve an artifact
> > > > > have always been namespaced as:
> > > > >
> > > > > ${project-name}/${element-name}/${cache-key}
> > > > >
> > > > > Whether to change that is an interesting discussion indeed.
> > > >
> > > > I'm merely thinking about potential reuse; and whether the
> > > > namespacing is causing an artificial reason why cache results are not
> > > > shareable.
> > >
> > > Not an artificial reason really no.
> > >
> > > Consider two elements from different projects, which purport to have
> > > the same inputs, these will produce the same `${cache-key}` but we
> > > cannot go so far as to say that the output for these two inputs can be
> > > the same.
> >
> > That would be unfortunate. So if you include e.g. webkit in two
> > projects that share an ArtifactCache, you'll be building it twice?
> >
> > > That would only be a good assumption for the `files/` subdirectory of
> > > the artifact, i.e. the output that a build actually generated.
> >
> > Right... I think we're getting to the core of the issue :). This
> > part of the artifact is the expensive part to produce. And any
> > opportunity we have not to redo work here, we should take IMO.
> >
> > > Depending on what kind of provenance information we encode in files in
> > > the `meta/` directory, and what kind of information is included in the
> > > build logs, this will be incorrect.
> >
> > Sure.
> >
> > > If for instance, we want to encode information about from which project
> > > the artifact came from in it's metadata; then reusing an artifact
> > > generated by one project in another project will certainly be
> > > incorrect; even if the `files/` output stored in that artifact are the
> > > same.
> >
> > It makes me think that we may need an indirect mapping here. One
> > where we can share the binaries between the Artifacts and maybe
> > fill out the "information about from which project
> > the artifact came from in it's metadata" when we build in the
> > context of said project.
> >
> > I'm expecting that, when using remote execution, we actually reuse
> > much of the builds, due to the fact that Actions care about the
> > source input tree, and not where the source input tree came from.
> > That concept could maybe be extended to an implementation for local
> > builds as well?
This sounds like it makes sense yes, but let's be careful not to cross
subject material and confuse folks.
Fair point. I'm actually trying to think through the impact of $project/$element in the $cachekey :). It lead us here. And helped me come to the conclusion that $project/$element in the $cachekey are not prohibiting build avoidance, assuming we do aliasing as you've called it below.
I think what you are saying is that, once we have determined that a
given build has identical inputs by observing the actual downloaded
data, we could reuse a previously created output for the artifact.
This is a matter of build avoidance and can only be determined at build
time.
Correct.
The other subject out there is that the project itself is the
identifier of the aliases used in sources, as such removing the project
from the cache key seems to me problematic (we still want the cache
keys to have the origins of the data in context, before even
downloading any sources).
Agreed.
> > > There is another angle to think about here too, which is the aliases
> > > which belong to the project.conf which produced an element: If the
> > > artifact is not produced within at least it's project's namespace; then
> > > we don't have any assurances at all that the aliases used to produce
> > > the final URLs used in Sources, are in fact the URLs we expect.
> >
> > Looking
> > at https://gitlab.com/BuildStream/buildstream/blob/master/buildstre
> > am/element.py#L2015 there is a whole lot that goes into the
> > cachekey. Definitely more than I expected. If the ArtifactCache
> > was meant as an interface with different implementations, why is
> > the name of the implementation included in the cache key?
This was introduced in commit 620006f2c3e6271e56743edc129c4fb19eab241a,
I vaguely recall discussing this on IRC, and think it may have been due
to the two existing artifact cache implementations at the time did not
support the same things (for example, the tar cache will remember which
files are hardlinked together when you intentionally created a hardlink
in your artifact, while the OSTree cache loses this information).
If we know that the outputs may differ depending on the artifact
storage medium, it will have to affect the cache key.
At this point, since we've pretty much swallowed up the artifact cache
and have opted to have an in house one which we should make support all
platforms we support, we can remove this in master.
(That said, we must still remember to bump the core artifact version
when fixing regressions due to migration to CAS, such as all the
missing permission bits, or whenever enhancing the CAS to handle file
attributes better).
> > > Interestingly this again touches on issue 569 which I raised but is not
> > > really related: By not binding the outputs of a project into a
> > > namespace, we are weakening the statement that `refs` can in fact be
> > > unique for the URLs we declare them for (because now the URLs fall out
> > > of the equation, as those are only defined by the project.conf).
> >
> > I see your point.
> >
> > So if we can have cache key map to a source key (the root of a
> > merkle tree representing the staged sources), then you can create a
> > cache key function that substitutes all of the sources with the
> > source key, and it will be correct without the element and project
> > names.
> >
>
> We could leverage SourceCache (#440) for this, which is essentially
> the mapping of $cachekey to Directory digest. Where the directory is
> representing the staged sources for the element.
>
> That gives us a second chance to consider whether there is anything to build. In short:
> 1) verify if $cachekey is in ArtifactCache, if not then
> 2) verify if $cachekey is in SourceCache, if
> a) yes, then calculate a new $cachekey_s that uses the directory digest instead of source alias/ref, and again check if $cachekey_s is in ArtifactCache
> b) no, then stage the sources, and calculate the digest of the staged sources. Then calculate a new $cachekey_s and check if $cachekey_s is in ArtifactCache
>
> If we do encounter a hit on $cachekey_s, we now have an oportunity to
> reuse the actual built artifacts, and we can create a new artifact
> with appropiate metadata (which may include the fact of reuse).
>
> Only when we do not encounter either $cachekey or $cachekey_s should
> we build anything. And then obviously put entries under both
> cachekeys on completion.
>
> I've left out how the SourceCache gets populated, because I think
> we've already had previous discussions on that.
>
> Now, that said, with the above in mind, it clearly doesn't matter if
> $cachekey contains project and element, because there is a second
> line in place for build avoidance. I think that is a good trade off.
>
> This doesn't make refactoring or renaming a project prohibitively
> expensive. And it also does the right thing (read: dpesn't waste
> time and resources) when people copy a .bst (or part of a project)
> rather than using a junction.
You are talking about aliasing keys for things which we found already
existed, which is not a bad idea, but opens up a separate can of worms,
which is how we trace the provenance of an aliased artifact, and how we
bake up metadata for it.
Add to that can of worms: We also need to take care of aliasing the
reverse dependencies instead of rebuilding everything, when a low level
element's cache key differs but is aliasable.
* I very much value the ability to rename your elements without it
causing any changes in cache key.
If it could introduce a new cache key that is aliasable, that is
almost the same thing, so I like it.
* Aliasing artifacts in the case that a Source configuration has
changed but resulted in staging identical inputs, is valuable too.
* I think that the chances of having two projects which happen to
build an element with the same source code at the same ref,
completely identical inputs and an identical dependency chain - are
astronomically small. Even more so if those builds are occurring on
the same infra.
I can think of some things people could do:
- Copy projectA. Rename copy to projectB. Build.
- Copy projectA. Rename copy to projecC. Prune elements. Add new element.
- Create project D. Copy element W from projectA (and its aliases).
The last one could be a "webkit" like element. We can't assume that everyone is going to junction instead of copy the .bst. The latter is easy to understand and quickly do.
If this were the sole purpose, then artifact aliasing would be far
to much trouble for the near zero benefits.
I think we need to be a bit careful with such statements, they sound as if based on data, where in reality they're based on assumptions. As our assumptions are not always aligned, I think that's where we then start.
That said, in this particular instance we can ignore our difference based on the agreement on the first two use cases.
Overall I do like your idea of aliasing cache keys for build avoidance
where possible.
Cool. I think we can park the aliasing part of the discussion and resume the artifact command subgroup part of the thread. I felt we were getting close to consensus on that as well?
Cheers,
-Tristan
Cheers,
Sander