Re: Source distribution, WAS: Re: Proposal for Remote Execution

From: Sander Striker <s striker striker nl>
To: Tristan Van Berkom <tristan vanberkom codethink co uk>
Cc: Jürg Billeter <j bitron ch>, BuildStream <buildstream-list gnome org>
Subject: Re: Source distribution, WAS: Re: Proposal for Remote Execution
Date: Thu, 7 Jun 2018 14:36:59 +0100

On Wed, Jun 6, 2018 at 10:04 PM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

On Wed, 2018-05-30 at 11:00 +0200, Sander Striker wrote:
> Hi,
>
> This post is not really commenting to the approach we're taking to
> implement remote execution, but rather what it enables us to do in
> terms of source distribution.

This sounds attractive, also note there has been a writeup at:

Should we use a CAS-based source cache as a mirror?
https://gitlab.com/BuildStream/buildstream/issues/418

Good question.

I'll follow up on the issue as well, thanks.

[...]
> > Staging sources into CAS is problematic if e.g. the whole .git
> > directory is
> > included. We should avoid this where possible. This is already a
> > concern
> > for caching of build trees (#21) as well and can be improved on
> > independently of any other steps.
>
> If i understand this correctly, we have an opportunity for simplified
> source "mirroring" here.
>
> If we introduce a SourceCache, which maps a sourcekey to a CAS
> directory node, the fetch operation becomes:
> 1) lookup SourceKey in SourceCache
> 1a.1) when an entry is present, fetch the Directory nodes from CAS
> and store them in local CAS.
> 1b.1) when no entry is found, fetch the source in the traditional
> sense
> 1b.2) stage the source in a temporary location (assumes #376 is
> resolved)
> 1b.3) put the staged source into the local CAS
> iff you have write permissions to SourceCache:
> 1b.3) upload the staged source to CAS
> 1b.4) put an entry into SourceCache
>
> This does optimize for remote execution, in the sense that actual
> fetching of the source as well as fetching the files that make up the
> source is avoided.
> I could envision a configuration option, allowing the user to say:
> - I am always building locally, make sure that my local CAS has
> everything I need to build locally.
> - I want the actual sources locally, make sure to always fetch in the
> traditional sense
>
> In terms of source mirroring. This could now be a central instance
> of bst that is just running bst fetch. It will have all of the
> fetched sources locally, in case they need to be inspected. All
> other instances will pull from SourceCache as CAS.

Not entirely, I think you want the sources to also have been *staged*
in the way they would be used by their elements, to commit the staged
results to CAS; rather than committing an entire git repo or a still
packed up tarball (at least, this is how I'm reading the intentions).

Not quite, I was assuming the staged source, following the proposal for

remote execution, which would stage at fetch time, and put those

staged sources into CAS.

We'd probably want this mirrored blob to be addressable by the element
and it's source configurations; i.e. the blob is one build directory
after all "Source" objects have unpacked what they want.

I'm not sure I follow what you are describing here.

This blob is also going directly into the artifact; i.e. see:

Caching of build trees
https://gitlab.com/BuildStream/buildstream/issues/21

So we might avoid redundancy when an artifact cache server is also a
source cache server...

I'm assuming you mean deduplication of source files between cached

build trees and source cache? If so, yes, that would be the case.

> The above would address fetching sources reliably. It would also
> address:
> #261: Investigate the use of git shallow clones (to build instead
> tarballs)
> #330: Support for mirroring of upstream sources
> To an extent, as the original format of the sources are not
> propagated beyond the host running the "mirror". However, without
> the need to set up anything in terms of serving the sources in their
> original format.
> #328: Support for downloading sources from mirrors
> It covers the case of getting source from a local ecosystem; at
> least for anything recent. Geographical awareness will need to be a
> higher level concept, that applies to endpoints like ArtifactCache,
> SourceCache, CAS, etc.

>
> The lifespan of SourceCache/CAS entries might be limited. This can
> be mitigated by keeping an archive of the original .bst files, and
> the SourceCache/CAS entries, such that there is always a way to go
> back [years] in time. Without even having to worry to much about the
> host tools (git, bzr, etc).
>
> Opening a workspace will still require the traditional source fetch.
> This should happen on-demand if there is no source present locally.
> Alternatively the user could force a fetch. This would be an action
> a user would perform if e.g. when preparing to be offline.
>
> Thoughts?

I have some concerns.

It feels less robust as we are not saving the source for posterity in
it's original format, I don't think we should trust the cache for
important things we want to store, and we should use something
explicitly purposed for that.

We could call it something different than "cache". And the design for

persistence is ours to decide. That said I don't think we disagree.

As you highlight above, even if we did go the extra mile to ensure that
the Source Cache is persistent for the sources we will need, there are
still places where we expect the original format to be available, which
mirrors should ensure (like workspaces).

Basically, I think that "trusting the intermediate source cache
designed for remote execution to consequently solve source mirroring"
is the wrong way to think about this,

I would paraphrase it slightly differently, as "considering to alter the

design of the source caching to be able to trust it for a subset of

source mirroring".

we are probably looking at two
potentially useful, but separate things:

o Sharing the SourceCache

- This is still just a "cache"

- This contains only the unpacked/staged sources
- Sharing this means I dont need the whole git locally,
if a shared source cache already has exactly this

o Source Mirroring

- I don't worry about third party sites going missing
- I can reproduce this forever, until I delete my mirror(s)
- I probably want the original sources in their original formats

I think that is a reasonable summary. I am still wondering if a generic

mirroring solution is needed as a core part of BuildStream. Solving

this for git is a known problem, and if you need to set up an endpoint

on the mirror anyway... Solving this for tarballs, is also a known

(e.g. an immutable caching proxy). For setting up Subversion mirrors

the same applies.

Ultimately it's a trade off of whether we want to have this in scope,

considering complexity and maintenance

I don't think that Sharing a SourceCache is really going to solve the
problem of Source Mirroring,

Right, it has the potential to solve a subset of problems that source mirroring

was initially set to solve. Specifically, setting up geographically close mirrors for

either latency or scalability. Basically what you break down in Source Cache

and Source Mirror above.

but there is no reason why Source
Mirroring could not be implemented using CAS technology also (and the
implementation possibly simplified by this ?).

Also there seems to be no reason for Sharing a SourceCache to be a bad
idea in a scenario where Source Mirroring is also available (the
SourceCache will act mostly as an optimized hot cache of things people
have been building lately).

Yes. Especially when it seems relatively cheap (in complexity) to get to that point.

Cheers,
-Tristan

Cheers,

Sander

Cheers,

Sander

Follow-Ups:
- Re: Source distribution, WAS: Re: Proposal for Remote Execution
  - From: Tristan Van Berkom

References:
- Re: Source distribution, WAS: Re: Proposal for Remote Execution
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]