Re: Discussion on source mirroring (with counter proposal)

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Jonathan Maw <jonathan maw codethink co uk>, buildstream-list gnome org
Subject: Re: Discussion on source mirroring (with counter proposal)
Date: Mon, 19 Mar 2018 14:47:30 +0900

Hi again.

After rereading this thread; I realize that I have left this in an
ambiguous state, and have made some assumptions that people can read my
mind. I want to rectify this and get the ball rolling properly.

This is going to be a fairly long-ish email, but this is necessary,
given that I've been too ambiguous and unclear.

On Fri, 2018-03-16 at 15:30 +0900, Tristan Van Berkom wrote:

Thanks for the writeup !

On Thu, 2018-03-15 at 16:30 +0000, Jonathan Maw wrote:

I've been giving some thought on source mirroring, recently, after 
reading the discussion at 
https://gitlab.com/BuildStream/buildstream/issues/179.

Source mirroring will be valuable for us because:
* The canonical upstream may disappear without warning
* The canonical upstream may be slow to access due to limited 
infrastructure or geographical distance.


* The organization may be mirroring everything in a local build farm
  - To be sure that their builds are repeatable in 10 years
  - To optimize fetch times on build machines
  - Without losing the information of what the original URL was


So what are the use cases here ?

This is the most important question to have answered before focusing on
a single solution.

Jonathan has stated that upstream sources can disappear, and
downloading from some locations can be slow; my response doesn't really
add much - so lets build on these two.


What are we trying to achieve ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before talking about how to achieve this, I want to pause and think
about what exactly we want to achieve - I feel that we are not all on
the same page about this.

I can see two separate ideas of what "mirror" means here:

  A.) Because a specific third party server proves to be unreliable,
      we want to be able to have a fallback for that server, which we
      can rollover to in the case that the upstream doesnt work.

      So this is a quick fix / bandaid for a specific pain point that a
      given organization experiences while using BuildStream; this
      allows us to have a tarball server for unreliable tarballs, or
      rollover to a github mirror for an unreliable upstream github.

  B.) For the same reasons, an organization may just never want to 
      experience unreliable access to source code ever again.

      The cost of hosting all sources which their BuildStream projects
      require on a single server; or even mirrored in some
      strategically placed locations (so that one can choose a mirror
      that is geographically closer when building), is a relatively low
      cost.

      Instead of many points of failure on various servers scattered
      across the globe, a single point of failure *that is under the
      control of the organization in question*, is much more desirable.


While a solution along the lines of (A) can improve things in the short
term, I feel this is just a bandaid and overall, we're living with the
same problem - e.g. a known fallback mirror for an upstream may one day
also prove to be unreliable. Instead, an organization which wants
reliability will eventually move towards (B), and decide to host a
centralized mirror themselves.


I have never really given much thought to the (A) use case, and I
clearly prefer a solution along the lines of (B).

While (A) and (B) are not entirely mutually exclusive (i.e., one could
achieve something like (B) using a solution designed for (A)), I worry
that (A) adds unnecessary complexity, when the goal should ultimately
be (B).


Unnecessary Complexity
~~~~~~~~~~~~~~~~~~~~~~
The unnecessary complexity I'm talking about is specifically:

  * We need to try multiple servers in a single session, in some way
    or another, this could be:

    - Teaching Sources to do it themselves, as Jonathan proposes

    - Having the core reconstruct and re-instantiate Source objects
      for each alias that they use, when one fails

    - Having the core contact multiple servers at startup time in
      order to choose which mirror is preferable

    Frankly, any of the the above is quite undesirably complex.

  * Configuration API is complex and burdensome to the user, if we
    essentially want to achieve (B) *anyway*, why do I have to list
    fallback mirrors for each and every source alias separately ?



Counter Proposal
~~~~~~~~~~~~~~~~
I have not been clear on the list about what my vision for this is, so
let me layout this counter proposal which I think is both easier to
implement, and also a more robust solution along the lines of the above
expressed (B).


   New Source.mirror() API
   ~~~~~~~~~~~~~~~~~~~~~~~
   For most Source implementations, this is exactly the same as what
   they are doing already in Source.track() or Source.fetch(), but with
   some different guarantees:

     - Guarantee that *everything* is mirrored for the given source,
       regardless of tracking branch or ref.

       This means shortcuts like shallow clones and such are just
       not allowed, and every time Source.mirror() is called, it should
       attempt to get the latest of everything.

     - The local source cache is built in such a way that it is
       reliable for downloading from another location.

       This means that we need an alternative code path for tarball and
       zip (internally `_downloadablefilesource.py`), such that the
       original filename is retained, and the file is not locally
       renamed to be a sha256sum filename instead.


   New `bst mirror` command
   ~~~~~~~~~~~~~~~~~~~~~~~~
   This works much like `bst fetch` or `bst track`, but calls the new
   Source.mirror() method instead.

   One exception is that in this mode there should not be a TARGET
   argument, instead all bst files in the project should be loaded.


   Single mirroring configuration API
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   In project.conf we just provide a URL to the mirror, which will be
   used *instead* of the upstreams listed in project data.

   If we support multiple mirror URLs in project.conf, then a session
   can scan them one time and choose the most optimal mirror.

   If we support user configuration overrides, then we expect the
   project maintainers to communicate the available mirrors to their
   developers or whomever builds that project, such that the user can
   just choose the mirror closest to them.


   Alternative implementation of Source.translate_url()
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Here the core currently simply expands an alias.

   In the case that we are building/fetching things, and there is a
   configured mirror, we have Source.translate_url() point to the
   mirror instead.

   This part might be a *little* tricky, but is certainly
   straightforward, given that:

     - BuildStream has knowledge of it's own source cache layout,
       i.e.: ${XDG_CACHE_HOME}/buildstream/sources/${source_kind}

     - Sources themselves have knowledge of how things are cached
       inside their dedicated cache directories.

   Resolving the correct URL here is easy.


   Setting up a mirror server
   ~~~~~~~~~~~~~~~~~~~~~~~~~~
   To setup a mirror server, one needs to have some knowledge of
   what things they are hosting, the process for setting up a mirror
   runs mostly like this:

     o Configure BuildStream to have it's source cache in a location
       on the server for hosting.

     o Configure access to the ${source_kind} specific subdirectories
       for the URI schemes which need to be supported.

       I.e. for tarball and zip, just HTTP(s) server is enough.

       For git, you might only also support HTTP(s) access, but you
       may also want to have support for "git://..." URI schemes.

     o Configure your mirror server to periodically do the following:

       - Periodically call `bst mirror` for the latest version of your
         project.

       - For more robust mirroring, you may want to go so far as to
         have a mirror session "triggered" by a commit to the git repo
         which is hosting your BuildStream project. This is just to
         ensure that you *never* miss a beat.


Properties of the counter proposal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My counter proposal, while being a bit of code that needs writing; is
less complex, it does not imply the "Unnecessary Complexity" drawbacks
which I have highlighted above.

The code involved in this solution, perhaps involves some boilerplate,
but is actually *easier to write* and very straight forward.

By forcing a mirror to be a single location, there is less points of
failure, and the one point of failure is under the control of the
people who maintain the said BuildStream project.

What this proposal does *not* do however, is add any possibility for
the bandaids described as (A) above, however it does provide a
practical solution for (B) - users who want (A) should be satisfied by
(B) as well - but the opposite is not entirely true.


I would very much like to hear feedback on this, particularly I would
like to know if I've missed something about the (A) approach which is
absolutely needed even in the presence of a (B) solution, and/or if it
is more desirable/necessary to have sessions try multiple URLs for the
same source in the same session - or, anything else I may have missed.


Regards,
    -Tristan

Follow-Ups:
- Re: Discussion on source mirroring (with counter proposal)
  - From: Paul Sherwood
- Re: Discussion on source mirroring (with counter proposal)
  - From: Agustín Benito Bethencourt
- Re: Discussion on source mirroring (with counter proposal)
  - From: Sander Striker

References:
- Discussion on source mirroring
  - From: Jonathan Maw
- Re: Discussion on source mirroring
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]