Re: [BuildStream] Partial local CAS

From: Sander Striker <s striker striker nl>
To: Jim MacArthur <jim macarthur codethink co uk>
Cc: buildstream-list gnome org
Subject: Re: [BuildStream] Partial local CAS
Date: Fri, 23 Nov 2018 19:46:29 +0100

Thanks Jim, those two items sound like a good starting point.

Looking forward to your questions.

Cheers,

Sander

On Fri, Nov 23, 2018, 18:05 Jim MacArthur via BuildStream-list <buildstream-list gnome org> wrote:

On 08/11/2018 22:01, Sander Striker via BuildStream-list wrote:
> Hi,
>
> After the exchange in the "Coping with partial artifacts" thread, I
> realize that we haven't actually had a conversation on list about
> partial local CAS, and by extension local ArtifactCache. Let me first
> explain what I mean with partial local CAS. Let's define it as a CAS
> that contains Tree and Directory nodes, but not the [all of] actual
> file content blobs.
>
> I'll outline the context and importance of this concept. In remote
> execution builds do not run on the local machine. As such to be able
> to perform a build, it is important to be able to _describe_ the
> inputs to a build. When all of the input files are locally available,
> this can be done. However, when the input files are not locally
> available, should we then incur the cost of fetching them? Is there
> another way?
>
> To answer that question let's review how remote execution is supposed
> to work again in the context of BuildStream. To build an element:
>
> 1) Compose a merkle tree of all dependencies, and all sources
> 2) Create a Command and an Action message
> 3) FindMissingBlobs(command, action, blobs in the merkle tree)
> 4) Upload the missing blobs
> 5) Submit the request to the execution service
> 6) Wait for the request to complete
> 7) Download the result merkle tree
> 8) Construct a merkle tree for the Artifact (based on the result)
> 9) FindMissingBlobs(blobs in the artifact merkle tree)
> 10) Upload the missing blobs
> 11) Store a ref to the artifact merkle tree in ArtifactCache
>
> Let's dive in a bit and look where the inefficiencies are in the
> current implementation.
>
> Step 1 happens during staging. More specifically in
> buildelement.py:stage(). We start with the dependencies. For
> directories backed by CAS, we don't need to actually stage them on the
> filesystem. We can import files between CAS directories by reference
> (hash), without even needing the files locally. This isn't currently
> implemented (_casbaseddirectory.py:import_files), but that should
> change with CAS-to-CAS import (MR !911).
> After the depencies are staged, we move on to the sources. Currently
> this is still fairly clunky, as we are actually staging sources on the
> filesystem and then importing that into our virtual staging directory
> (element.py:_stage_sources_at). With SourceCache this should be as
> efficient as staging dependencies for non-modified elements.
>
> Step 2 through 11 all happen during _sandboxremote.py:run().
> Step 2-4 aren't currently implemented in this fashion, and instead
> serially call a number of network RPCs. In _sandboxremote.py:run() a
> call is made to cascache.push_directory(). This will push up any
> missing directory nodes, or any missing files.
> In _sandboxremote:run_remote_command() we are using
> cascache.push_message(), followed by cascache.verify_digest_pushed().
> This results in a Write RPC, followed by a FindMissingBlobs RPC. For
> both the Command and the Action. In short, we could be eliminating a
> couple of RPCs and thus network roundrips here.
> I'll skip over step 5-6 as these are not very interesting. Although it
> should be noted that _sandboxremote.py:run() is ignoring the build
> logs from the execution response.
> In step 7, which happens in _sandboxremote.py:process_job_output(), we
> take a Tree digest that we received from the execution service, and
> use it in a call to cascache.pull_tree(). This will fetch all of the
> file blobs that are present in the tree that are not available
> locally. It will also store all of the directory nodes that are
> referenced in the tree, and return the root digest. This is used to
> construct the result virtual directory of the sandbox.
> In step 8 we go back to constructing a file system representation of
> the artifact, instead of using a CAS backed directory. This happens
> in element.py:assemble() through a call to cascache.commit(). This
> will do a local filesystem import of files, the majority of which we
> exported in step 7. It will put an entry in the local ArtifactCache.
> Step 9-11 happen during the push phase. Here we rely on
> cascache.push() to ensure that the artifact is made available on the
> remote CAS server.
>
> Sidenote while we're here: apart from step 9-11 we don't actually make
> it clear to the scheduler which resources are needed. As far as it is
> concerned a remote build job is currently taking up PROCESS tokens.
>
> If you made it all the way here, thank you :). I think we need to
> eliminate the unneeded filesystem access first.
>
I think there are two things we can do right now which won't be
controversial:

1) Remove the calls to verify_digest_pushed - on inspection of the code,
I'm now convinced these cannot do anything useful and were just added to
be defensive.

2) Alter Element._cache_artifact so it uses the virtual directory system
instead of a plain filing system.

Now I've caught up on the partial CAS discussion, I have quite a few
questions about it, but I'll put them in a separate email.

_______________________________________________
BuildStream-list mailing list
BuildStream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list

Cheers,

Sander

References:
- [BuildStream] Partial local CAS
  - From: Sander Striker
- Re: [BuildStream] Partial local CAS
  - From: Jim MacArthur

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]