On Wed, 2018-05-30 at 11:00 +0200, Sander Striker wrote:
> Hi,
>
> This post is not really commenting to the approach we're taking to
> implement remote execution, but rather what it enables us to do in
> terms of source distribution.
This sounds attractive, also note there has been a writeup at:
Should we use a CAS-based source cache as a mirror?
https://gitlab.com/BuildStream/buildstream/issues/418
Good question.
[...]
> > Staging sources into CAS is problematic if e.g. the whole .git
> > directory is
> > included. We should avoid this where possible. This is already a
> > concern
> > for caching of build trees (#21) as well and can be improved on
> > independently of any other steps.
>
> If i understand this correctly, we have an opportunity for simplified
> source "mirroring" here.
>
> If we introduce a SourceCache, which maps a sourcekey to a CAS
> directory node, the fetch operation becomes:
> 1) lookup SourceKey in SourceCache
> 1a.1) when an entry is present, fetch the Directory nodes from CAS
> and store them in local CAS.
> 1b.1) when no entry is found, fetch the source in the traditional
> sense
> 1b.2) stage the source in a temporary location (assumes #376 is
> resolved)
> 1b.3) put the staged source into the local CAS
> iff you have write permissions to SourceCache:
> 1b.3) upload the staged source to CAS
> 1b.4) put an entry into SourceCache
>
> This does optimize for remote execution, in the sense that actual
> fetching of the source as well as fetching the files that make up the
> source is avoided.
> I could envision a configuration option, allowing the user to say:
> - I am always building locally, make sure that my local CAS has
> everything I need to build locally.
> - I want the actual sources locally, make sure to always fetch in the
> traditional sense
>
> In terms of source mirroring. This could now be a central instance
> of bst that is just running bst fetch. It will have all of the
> fetched sources locally, in case they need to be inspected. All
> other instances will pull from SourceCache as CAS.
Not entirely, I think you want the sources to also have been *staged*
in the way they would be used by their elements, to commit the staged
results to CAS; rather than committing an entire git repo or a still
packed up tarball (at least, this is how I'm reading the intentions).
We'd probably want this mirrored blob to be addressable by the element
and it's source configurations; i.e. the blob is one build directory
after all "Source" objects have unpacked what they want.
This blob is also going directly into the artifact; i.e. see:
Caching of build trees
https://gitlab.com/BuildStream/buildstream/issues/21
So we might avoid redundancy when an artifact cache server is also a
source cache server...
> The above would address fetching sources reliably. It would also
> address:
> #261: Investigate the use of git shallow clones (to build instead
> tarballs)
> #330: Support for mirroring of upstream sources
> To an extent, as the original format of the sources are not
> propagated beyond the host running the "mirror". However, without
> the need to set up anything in terms of serving the sources in their
> original format.
> #328: Support for downloading sources from mirrors
> It covers the case of getting source from a local ecosystem; at
> least for anything recent. Geographical awareness will need to be a
> higher level concept, that applies to endpoints like ArtifactCache,
> SourceCache, CAS, etc.
>
> The lifespan of SourceCache/CAS entries might be limited. This can
> be mitigated by keeping an archive of the original .bst files, and
> the SourceCache/CAS entries, such that there is always a way to go
> back [years] in time. Without even having to worry to much about the
> host tools (git, bzr, etc).
>
> Opening a workspace will still require the traditional source fetch.
> This should happen on-demand if there is no source present locally.
> Alternatively the user could force a fetch. This would be an action
> a user would perform if e.g. when preparing to be offline.
>
> Thoughts?
I have some concerns.
It feels less robust as we are not saving the source for posterity in
it's original format, I don't think we should trust the cache for
important things we want to store, and we should use something
explicitly purposed for that.
As you highlight above, even if we did go the extra mile to ensure that
the Source Cache is persistent for the sources we will need, there are
still places where we expect the original format to be available, which
mirrors should ensure (like workspaces).
Basically, I think that "trusting the intermediate source cache
designed for remote execution to consequently solve source mirroring"
is the wrong way to think about this,
we are probably looking at two
potentially useful, but separate things:
o Sharing the SourceCache
- This is still just a "cache"
- This contains only the unpacked/staged sources
- Sharing this means I dont need the whole git locally,
if a shared source cache already has exactly this
o Source Mirroring
- I don't worry about third party sites going missing
- I can reproduce this forever, until I delete my mirror(s)
- I probably want the original sources in their original formats
I don't think that Sharing a SourceCache is really going to solve the
problem of Source Mirroring,
but there is no reason why Source
Mirroring could not be implemented using CAS technology also (and the
implementation possibly simplified by this ?).
Also there seems to be no reason for Sharing a SourceCache to be a bad
idea in a scenario where Source Mirroring is also available (the
SourceCache will act mostly as an optimized hot cache of things people
have been building lately).
Cheers,
-Tristan