Re: Discussion on source mirroring (with counter proposal)
- From: Paul Sherwood <paul sherwood codethink co uk>
- To: Tristan Van Berkom <tristan vanberkom codethink co uk>
- Cc: Jonathan Maw <jonathan maw codethink co uk>, buildstream-list gnome org
- Subject: Re: Discussion on source mirroring (with counter proposal)
- Date: Mon, 19 Mar 2018 08:44:03 +0000
Hi Tristan,
I can't see any justification for addressing the specific A.) case
either.
Note that the Baserock project successfully addressed the B.) base quite
a long time ago with git.baserock.org (and baserock trove in general),
while making the assumption that standardising everything into Git was a
worthwhile simplification.
There are further nuances:
- what happens if an upstream source is tampered with, and the faked
history or malicious code ends up being mirrored into B.)? This has
happened several times recently in some upstreams
- what happens if B.) itself is tampered with?
I think the implication is that we need integrity checks; for the
ultra-paranoid, maybe multiple mirrors checking each other's homework.
br
Paul
On 2018-03-19 05:47, Tristan Van Berkom wrote:
Hi again.
After rereading this thread; I realize that I have left this in an
ambiguous state, and have made some assumptions that people can read my
mind. I want to rectify this and get the ball rolling properly.
This is going to be a fairly long-ish email, but this is necessary,
given that I've been too ambiguous and unclear.
On Fri, 2018-03-16 at 15:30 +0900, Tristan Van Berkom wrote:
Thanks for the writeup !
On Thu, 2018-03-15 at 16:30 +0000, Jonathan Maw wrote:
> I've been giving some thought on source mirroring, recently, after
> reading the discussion at
> https://gitlab.com/BuildStream/buildstream/issues/179.
>
> Source mirroring will be valuable for us because:
> * The canonical upstream may disappear without warning
> * The canonical upstream may be slow to access due to limited
> infrastructure or geographical distance.
* The organization may be mirroring everything in a local build farm
- To be sure that their builds are repeatable in 10 years
- To optimize fetch times on build machines
- Without losing the information of what the original URL was
So what are the use cases here ?
This is the most important question to have answered before focusing on
a single solution.
Jonathan has stated that upstream sources can disappear, and
downloading from some locations can be slow; my response doesn't really
add much - so lets build on these two.
What are we trying to achieve ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before talking about how to achieve this, I want to pause and think
about what exactly we want to achieve - I feel that we are not all on
the same page about this.
I can see two separate ideas of what "mirror" means here:
A.) Because a specific third party server proves to be unreliable,
we want to be able to have a fallback for that server, which we
can rollover to in the case that the upstream doesnt work.
So this is a quick fix / bandaid for a specific pain point that a
given organization experiences while using BuildStream; this
allows us to have a tarball server for unreliable tarballs, or
rollover to a github mirror for an unreliable upstream github.
B.) For the same reasons, an organization may just never want to
experience unreliable access to source code ever again.
The cost of hosting all sources which their BuildStream projects
require on a single server; or even mirrored in some
strategically placed locations (so that one can choose a mirror
that is geographically closer when building), is a relatively low
cost.
Instead of many points of failure on various servers scattered
across the globe, a single point of failure *that is under the
control of the organization in question*, is much more desirable.
While a solution along the lines of (A) can improve things in the short
term, I feel this is just a bandaid and overall, we're living with the
same problem - e.g. a known fallback mirror for an upstream may one day
also prove to be unreliable. Instead, an organization which wants
reliability will eventually move towards (B), and decide to host a
centralized mirror themselves.
I have never really given much thought to the (A) use case, and I
clearly prefer a solution along the lines of (B).
While (A) and (B) are not entirely mutually exclusive (i.e., one could
achieve something like (B) using a solution designed for (A)), I worry
that (A) adds unnecessary complexity, when the goal should ultimately
be (B).
Unnecessary Complexity
~~~~~~~~~~~~~~~~~~~~~~
The unnecessary complexity I'm talking about is specifically:
* We need to try multiple servers in a single session, in some way
or another, this could be:
- Teaching Sources to do it themselves, as Jonathan proposes
- Having the core reconstruct and re-instantiate Source objects
for each alias that they use, when one fails
- Having the core contact multiple servers at startup time in
order to choose which mirror is preferable
Frankly, any of the the above is quite undesirably complex.
* Configuration API is complex and burdensome to the user, if we
essentially want to achieve (B) *anyway*, why do I have to list
fallback mirrors for each and every source alias separately ?
Counter Proposal
~~~~~~~~~~~~~~~~
I have not been clear on the list about what my vision for this is, so
let me layout this counter proposal which I think is both easier to
implement, and also a more robust solution along the lines of the above
expressed (B).
New Source.mirror() API
~~~~~~~~~~~~~~~~~~~~~~~
For most Source implementations, this is exactly the same as what
they are doing already in Source.track() or Source.fetch(), but with
some different guarantees:
- Guarantee that *everything* is mirrored for the given source,
regardless of tracking branch or ref.
This means shortcuts like shallow clones and such are just
not allowed, and every time Source.mirror() is called, it should
attempt to get the latest of everything.
- The local source cache is built in such a way that it is
reliable for downloading from another location.
This means that we need an alternative code path for tarball and
zip (internally `_downloadablefilesource.py`), such that the
original filename is retained, and the file is not locally
renamed to be a sha256sum filename instead.
New `bst mirror` command
~~~~~~~~~~~~~~~~~~~~~~~~
This works much like `bst fetch` or `bst track`, but calls the new
Source.mirror() method instead.
One exception is that in this mode there should not be a TARGET
argument, instead all bst files in the project should be loaded.
Single mirroring configuration API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In project.conf we just provide a URL to the mirror, which will be
used *instead* of the upstreams listed in project data.
If we support multiple mirror URLs in project.conf, then a session
can scan them one time and choose the most optimal mirror.
If we support user configuration overrides, then we expect the
project maintainers to communicate the available mirrors to their
developers or whomever builds that project, such that the user can
just choose the mirror closest to them.
Alternative implementation of Source.translate_url()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here the core currently simply expands an alias.
In the case that we are building/fetching things, and there is a
configured mirror, we have Source.translate_url() point to the
mirror instead.
This part might be a *little* tricky, but is certainly
straightforward, given that:
- BuildStream has knowledge of it's own source cache layout,
i.e.: ${XDG_CACHE_HOME}/buildstream/sources/${source_kind}
- Sources themselves have knowledge of how things are cached
inside their dedicated cache directories.
Resolving the correct URL here is easy.
Setting up a mirror server
~~~~~~~~~~~~~~~~~~~~~~~~~~
To setup a mirror server, one needs to have some knowledge of
what things they are hosting, the process for setting up a mirror
runs mostly like this:
o Configure BuildStream to have it's source cache in a location
on the server for hosting.
o Configure access to the ${source_kind} specific subdirectories
for the URI schemes which need to be supported.
I.e. for tarball and zip, just HTTP(s) server is enough.
For git, you might only also support HTTP(s) access, but you
may also want to have support for "git://..." URI schemes.
o Configure your mirror server to periodically do the following:
- Periodically call `bst mirror` for the latest version of your
project.
- For more robust mirroring, you may want to go so far as to
have a mirror session "triggered" by a commit to the git repo
which is hosting your BuildStream project. This is just to
ensure that you *never* miss a beat.
Properties of the counter proposal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My counter proposal, while being a bit of code that needs writing; is
less complex, it does not imply the "Unnecessary Complexity" drawbacks
which I have highlighted above.
The code involved in this solution, perhaps involves some boilerplate,
but is actually *easier to write* and very straight forward.
By forcing a mirror to be a single location, there is less points of
failure, and the one point of failure is under the control of the
people who maintain the said BuildStream project.
What this proposal does *not* do however, is add any possibility for
the bandaids described as (A) above, however it does provide a
practical solution for (B) - users who want (A) should be satisfied by
(B) as well - but the opposite is not entirely true.
I would very much like to hear feedback on this, particularly I would
like to know if I've missed something about the (A) approach which is
absolutely needed even in the presence of a (B) solution, and/or if it
is more desirable/necessary to have sessions try multiple URLs for the
same source in the same session - or, anything else I may have missed.
Regards,
-Tristan
_______________________________________________
Buildstream-list mailing list
Buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]