Re: Discussion on source mirroring (with counter proposal)

From: Sander Striker <s striker striker nl>
To: Tristan Van Berkom <tristan vanberkom codethink co uk>
Cc: Jonathan Maw <jonathan maw codethink co uk>, buildstream-list gnome org
Subject: Re: Discussion on source mirroring (with counter proposal)
Date: Mon, 19 Mar 2018 14:03:15 +0000

Hi Tristan,

On Mon, Mar 19, 2018 at 6:47 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

[...]

What are we trying to achieve ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before talking about how to achieve this, I want to pause and think
about what exactly we want to achieve - I feel that we are not all on
the same page about this.

I can see two separate ideas of what "mirror" means here:

A.) Because a specific third party server proves to be unreliable,
we want to be able to have a fallback for that server, which we
can rollover to in the case that the upstream doesnt work.

So this is a quick fix / bandaid for a specific pain point that a
given organization experiences while using BuildStream; this
allows us to have a tarball server for unreliable tarballs, or
rollover to a github mirror for an unreliable upstream github.

B.) For the same reasons, an organization may just never want to
experience unreliable access to source code ever again.

The cost of hosting all sources which their BuildStream projects
require on a single server; or even mirrored in some
strategically placed locations (so that one can choose a mirror
that is geographically closer when building), is a relatively low
cost.

Instead of many points of failure on various servers scattered
across the globe, a single point of failure *that is under the
control of the organization in question*, is much more desirable.

While a solution along the lines of (A) can improve things in the short
term, I feel this is just a bandaid and overall, we're living with the
same problem - e.g. a known fallback mirror for an upstream may one day
also prove to be unreliable. Instead, an organization which wants
reliability will eventually move towards (B), and decide to host a
centralized mirror themselves.

I have never really given much thought to the (A) use case, and I
clearly prefer a solution along the lines of (B).

While (A) and (B) are not entirely mutually exclusive (i.e., one could
achieve something like (B) using a solution designed for (A)), I worry
that (A) adds unnecessary complexity, when the goal should ultimately
be (B).

Hold that thought on being able to achieve B when providing A.

Unnecessary Complexity
~~~~~~~~~~~~~~~~~~~~~~
The unnecessary complexity I'm talking about is specifically:

* We need to try multiple servers in a single session, in some way
or another, this could be:

- Teaching Sources to do it themselves, as Jonathan proposes

- Having the core reconstruct and re-instantiate Source objects
for each alias that they use, when one fails

- Having the core contact multiple servers at startup time in
order to choose which mirror is preferable

Frankly, any of the the above is quite undesirably complex.

* Configuration API is complex and burdensome to the user, if we
essentially want to achieve (B) *anyway*, why do I have to list
fallback mirrors for each and every source alias separately ?

Counter Proposal
~~~~~~~~~~~~~~~~
I have not been clear on the list about what my vision for this is, so
let me layout this counter proposal which I think is both easier to
implement, and also a more robust solution along the lines of the above
expressed (B).

New Source.mirror() API
~~~~~~~~~~~~~~~~~~~~~~~
For most Source implementations, this is exactly the same as what
they are doing already in Source.track() or Source.fetch(), but with
some different guarantees:

- Guarantee that *everything* is mirrored for the given source,
regardless of tracking branch or ref.

This means shortcuts like shallow clones and such are just
not allowed, and every time Source.mirror() is called, it should
attempt to get the latest of everything.

- The local source cache is built in such a way that it is
reliable for downloading from another location.

This means that we need an alternative code path for tarball and
zip (internally `_downloadablefilesource.py`), such that the
original filename is retained, and the file is not locally
renamed to be a sha256sum filename instead.

Is this a required method for every Source to implement? If not, what happens if the Source does not implement it?

New `bst mirror` command
~~~~~~~~~~~~~~~~~~~~~~~~
This works much like `bst fetch` or `bst track`, but calls the new
Source.mirror() method instead.

One exception is that in this mode there should not be a TARGET
argument, instead all bst files in the project should be loaded.

Single mirroring configuration API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In project.conf we just provide a URL to the mirror, which will be
used *instead* of the upstreams listed in project data.

If we support multiple mirror URLs in project.conf, then a session
can scan them one time and choose the most optimal mirror.

If we support user configuration overrides, then we expect the
project maintainers to communicate the available mirrors to their
developers or whomever builds that project, such that the user can
just choose the mirror closest to them.

How do you treat partial mirrors? If I am dealing with multiple projects with them potentially having a subset of each other, what happens?

Alternative implementation of Source.translate_url()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here the core currently simply expands an alias.

In the case that we are building/fetching things, and there is a
configured mirror, we have Source.translate_url() point to the
mirror instead.

This part might be a *little* tricky, but is certainly
straightforward, given that:

- BuildStream has knowledge of it's own source cache layout,
i.e.: ${XDG_CACHE_HOME}/buildstream/sources/${source_kind}

- Sources themselves have knowledge of how things are cached
inside their dedicated cache directories.

Resolving the correct URL here is easy.

Setting up a mirror server
~~~~~~~~~~~~~~~~~~~~~~~~~~
To setup a mirror server, one needs to have some knowledge of
what things they are hosting, the process for setting up a mirror
runs mostly like this:

o Configure BuildStream to have it's source cache in a location
on the server for hosting.

o Configure access to the ${source_kind} specific subdirectories
for the URI schemes which need to be supported.

I.e. for tarball and zip, just HTTP(s) server is enough.

For git, you might only also support HTTP(s) access, but you
may also want to have support for "git://..." URI schemes.

o Configure your mirror server to periodically do the following:

- Periodically call `bst mirror` for the latest version of your
project.

- For more robust mirroring, you may want to go so far as to
have a mirror session "triggered" by a commit to the git repo
which is hosting your BuildStream project. This is just to
ensure that you *never* miss a beat.

I still feel we are overstepping scope. The issue of source persistence is not unique to BuildStream. Organizations may already have solutions for this in place, which they would like to continue to leverage. What type of solution is in place is dependent on Source types; git and subversion are different beasts than say a package repository. They may exhibit different scalability characteristics as well.

By requiring that mirrors are BuildStream created/managed mirrors dismisses those solutions. Or at least complicates their use.

Properties of the counter proposal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My counter proposal, while being a bit of code that needs writing; is
less complex, it does not imply the "Unnecessary Complexity" drawbacks
which I have highlighted above.

The code involved in this solution, perhaps involves some boilerplate,
but is actually *easier to write* and very straight forward.

By forcing a mirror to be a single location, there is less points of
failure, and the one point of failure is under the control of the
people who maintain the said BuildStream project.

What this proposal does *not* do however, is add any possibility for
the bandaids described as (A) above, however it does provide a
practical solution for (B) - users who want (A) should be satisfied by
(B) as well - but the opposite is not entirely true.

I would very much like to hear feedback on this, particularly I would
like to know if I've missed something about the (A) approach which is
absolutely needed even in the presence of a (B) solution, and/or if it
is more desirable/necessary to have sessions try multiple URLs for the
same source in the same session - or, anything else I may have missed.

I think the project focus of the mirror is going to make the setup more complicated for multiple projects, as you now need to start creating a composed single project to ensure your mirroring needs are covered.

I further think that not being able to reuse existing mirror[ing solutions] is a negative.

Cheers,

Sander

Regards,
-Tristan

_______________________________________________
Buildstream-list mailing list
Buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list

Follow-Ups:
- Re: Discussion on source mirroring (with counter proposal)
  - From: Tristan Van Berkom

References:
- Discussion on source mirroring
  - From: Jonathan Maw
- Re: Discussion on source mirroring
  - From: Tristan Van Berkom
- Re: Discussion on source mirroring (with counter proposal)
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]