GNOME.org

[Notes] [Git][BuildGrid/buildgrid][mablanch/139-emit-build-metrics] 30 commits: buildgrid/utils.py: New `get_hash_type` function.

From: Martin Blanchard <gitlab mg gitlab com>
To: buildstream-notifications-list gnome org
Subject: [Notes] [Git][BuildGrid/buildgrid][mablanch/139-emit-build-metrics] 30 commits: buildgrid/utils.py: New `get_hash_type` function.
Date: Tue, 04 Dec 2018 14:32:47 +0000

Title: GitLab

Martin Blanchard pushed to branch mablanch/139-emit-build-metrics at BuildGrid / buildgrid

Commits:

859c0fa8

by Finn at 2018-11-27T15:25:25Z

buildgrid/utils.py: New `get_hash_type` function.

Returns the hash type.

e1091b04

by Finn at 2018-11-27T15:25:25Z

Execution instance can now return hash type.

1dc5d2d2

by Finn at 2018-11-27T15:25:25Z

Adding cache capabilities to CAS instance.

Hash type, batch size and symlink strategy.

35c901bd
by Finn at 2018-11-27T16:23:49Z
```
Adding capabilities service.
```

5cb38e2a

by Finn at 2018-11-27T16:23:52Z

Adding capabilities instance of service.

32ad653d

by Finn at 2018-11-27T16:23:52Z

Adding client interface for capabilities service.

e15a9c91

by Finn at 2018-11-27T16:23:52Z

Capabilities service can now be added to server.

bd5587ea

by Finn at 2018-11-27T16:23:52Z

tests/utils/capabilities.py: Adding service for tests.

d94fa258

by Finn at 2018-11-27T16:23:52Z

tests/integration/capabilities_service.py: Creating tests.

6f90a553

by Finn at 2018-11-27T16:23:52Z

buildgrid/_app/commands/cmd_capabilities.py: New command.

db65c5ec

by Raoul Hidalgo Charman at 2018-11-28T12:23:41Z

test_storage.py: change import order to appease pylint

Newer versions of pylint complain.

db53ffbc

by Martin Blanchard at 2018-11-29T08:59:48Z

execution/service.py: Calculate client counts

https://gitlab.com/BuildGrid/buildgrid/issues/132

5ecfb7f8

by Martin Blanchard at 2018-11-29T08:59:48Z

bots/service.py: Calculate bot counts

https://gitlab.com/BuildGrid/buildgrid/issues/132

df5b6a80

by Martin Blanchard at 2018-11-29T08:59:48Z

Expose scheduler from controller's service instances

https://gitlab.com/BuildGrid/buildgrid/issues/132

5e608d6b

by Martin Blanchard at 2018-11-29T08:59:48Z

.pylintrc: Add Duration to list of ignored classes

https://gitlab.com/BuildGrid/buildgrid/issues/132

c167a1d0

by Martin Blanchard at 2018-11-29T08:59:48Z

server/job.py: Calculate job queue time and retry count

https://gitlab.com/BuildGrid/buildgrid/issues/132

8fc6d17d

by Martin Blanchard at 2018-11-29T08:59:48Z

server/scheduler.py: Calculate job counts

https://gitlab.com/BuildGrid/buildgrid/issues/132

397f385b

by Martin Blanchard at 2018-11-29T08:59:48Z

setup.py: Introduce janus dependency (Python >= 3.5.3)

https://gitlab.com/BuildGrid/buildgrid/issues/132

dbbcdb50

by Martin Blanchard at 2018-11-29T08:59:48Z

settings.py: Add a monitoring period setting

https://gitlab.com/BuildGrid/buildgrid/issues/132

50f3f63b

by Martin Blanchard at 2018-11-29T08:59:48Z

server/instance.py: Run a monitoring task periodically

https://gitlab.com/BuildGrid/buildgrid/issues/132

100e91b9
by Arber Xhindoli at 2018-11-30T16:40:19Z
```
Implementation of the getTree method.
```

49586d88

by Martin Blanchard at 2018-11-30T17:19:40Z

_monitoring.py: Ensure buffer gets flushed in non-async mode

https://gitlab.com/BuildGrid/buildgrid/issues/23

5673b009

by Martin Blanchard at 2018-11-30T17:20:19Z

settings.py: Add a log record format setting

https://gitlab.com/BuildGrid/buildgrid/issues/23

b763ec5f

by Martin Blanchard at 2018-11-30T17:20:19Z

.gitlab-ci.yml: Drop verbose mode during tests

https://gitlab.com/BuildGrid/buildgrid/issues/23

76459e0a

by Martin Blanchard at 2018-11-30T17:20:19Z

Move --verbose CLI option do relevant subcommands

https://gitlab.com/BuildGrid/buildgrid/issues/23

94bf76a7

by Martin Blanchard at 2018-11-30T17:20:19Z

cli.py: Allow filtering log record on their domains

This patch allows filtering the log records that get printed to stdout
and stderr based on their domain name using the BGD_MESSAGE_DEBUG
environment variable. A colon separated list of domains is expected:

  BGD_MESSAGE_DEBUG="buildgrid.server.cas:buildgrid.client.cas"

https://gitlab.com/BuildGrid/buildgrid/issues/23

8f5d71c5

by Martin Blanchard at 2018-11-30T17:20:19Z

server/instance.py: Setup a logging task at startup

https://gitlab.com/BuildGrid/buildgrid/issues/23

4db4af8f
by Martin Blanchard at 2018-12-04T14:23:58Z
```
job.py: Expose its ActionResult origin
```

2bcc169d

by Martin Blanchard at 2018-12-04T14:23:58Z

scheduler.py: Allow registering for build metadata updates

442acf9b

by Martin Blanchard at 2018-12-04T14:23:58Z

server/instance.py: Register tasks for build metrics

29 changed files:

.gitlab-ci.yml
.pylintrc
buildgrid/_app/cli.py
buildgrid/_app/commands/cmd_bot.py
+ buildgrid/_app/commands/cmd_capabilities.py
buildgrid/_app/commands/cmd_server.py
+ buildgrid/client/capabilities.py
buildgrid/client/cas.py
buildgrid/server/_monitoring.py
buildgrid/server/bots/instance.py
buildgrid/server/bots/service.py
+ buildgrid/server/capabilities/__init__.py
+ buildgrid/server/capabilities/instance.py
+ buildgrid/server/capabilities/service.py
buildgrid/server/cas/instance.py
buildgrid/server/cas/service.py
buildgrid/server/execution/instance.py
buildgrid/server/execution/service.py
buildgrid/server/instance.py
buildgrid/server/job.py
buildgrid/server/operations/instance.py
buildgrid/server/operations/service.py
buildgrid/server/scheduler.py
buildgrid/settings.py
buildgrid/utils.py
setup.py
tests/cas/test_storage.py
+ tests/integration/capabilities_service.py
+ tests/utils/capabilities.py

Changes:

.gitlab-ci.yml

@@ -2,7 +2,7 @@
  image: python:3.5-stretch
  variables:
 -  BGD: bgd --verbose
 +  BGD: bgd
  stages:
    - test

@@ -185,6 +185,7 @@ ignore-on-opaque-inference=yes
  # for classes with dynamically set attributes). This supports the use of
  # qualified names.
  ignored-classes=google.protobuf.any_pb2.Any,
 +                google.protobuf.duration_pb2.Duration,
                  google.protobuf.timestamp_pb2.Timestamp
  # List of module names for which member attributes should not be checked
@@ -460,6 +461,7 @@ known-third-party=boto3,
                    enchant,
                    google,
                    grpc,
 +                  janus,
                    moto,
                    yaml

buildgrid/_app/cli.py

@@ -23,10 +23,12 @@ will be attempted to be imported.
  import logging
  import os
 +import sys
  import click
  import grpc
 +from buildgrid.settings import LOG_RECORD_FORMAT
  from buildgrid.utils import read_file
  CONTEXT_SETTINGS = dict(auto_envvar_prefix='BUILDGRID')
@@ -138,28 +140,71 @@ class BuildGridCLI(click.MultiCommand):
          return mod.cli
 +class DebugFilter(logging.Filter):
++
 +    def __init__(self, debug_domains, name=''):
 +        super().__init__(name=name)
 +        self.__domains_tree = {}
++
 +        for domain in debug_domains.split(':'):
 +            domains_tree = self.__domains_tree
 +            for label in domain.split('.'):
 +                if all(key not in domains_tree for key in [label, '*']):
 +                    domains_tree[label] = {}
 +                domains_tree = domains_tree[label]
++
 +    def filter(self, record):
 +        domains_tree, last_match = self.__domains_tree, None
 +        for label in record.name.split('.'):
 +            if all(key not in domains_tree for key in [label, '*']):
 +                return False
 +            last_match = label if label in domains_tree else '*'
 +            domains_tree = domains_tree[last_match]
 +        if domains_tree and '*' not in domains_tree:
 +            return False
 +        return True
++
++
 +def setup_logging(verbosity=0, debug_mode=False):
 +    """Deals with loggers verbosity"""
 +    asyncio_logger = logging.getLogger('asyncio')
 +    root_logger = logging.getLogger()
++
 +    log_handler = logging.StreamHandler(stream=sys.stdout)
 +    for log_filter in root_logger.filters:
 +        log_handler.addFilter(log_filter)
++
 +    logging.basicConfig(format=LOG_RECORD_FORMAT, handlers=[log_handler])
++
 +    if verbosity == 1:
 +        root_logger.setLevel(logging.WARNING)
 +    elif verbosity == 2:
 +        root_logger.setLevel(logging.INFO)
 +    elif verbosity >= 3:
 +        root_logger.setLevel(logging.DEBUG)
 +    else:
 +        root_logger.setLevel(logging.ERROR)
++
 +    if not debug_mode:
 +        asyncio_logger.setLevel(logging.CRITICAL)
 +    else:
 +        asyncio_logger.setLevel(logging.DEBUG)
 +        root_logger.setLevel(logging.DEBUG)
++
++
  @click.command(cls=BuildGridCLI, context_settings=CONTEXT_SETTINGS)
 -@click.option('-v', '--verbose', count=True,
 -              help='Increase log verbosity level.')
  @pass_context
 -def cli(context, verbose):
 +def cli(context):
      """BuildGrid App"""
 -    logger = logging.getLogger()
 +    root_logger = logging.getLogger()
      # Clean-up root logger for any pre-configuration:
 -    for log_handler in logger.handlers[:]:
 -        logger.removeHandler(log_handler)
 -    for log_filter in logger.filters[:]:
 -        logger.removeFilter(log_filter)
+-
 -    logging.basicConfig(
 -        format='%(asctime)s:%(name)32.32s][%(levelname)5.5s]: %(message)s')
+-
 -    if verbose == 1:
 -        logger.setLevel(logging.WARNING)
 -    elif verbose == 2:
 -        logger.setLevel(logging.INFO)
 -    elif verbose >= 3:
 -        logger.setLevel(logging.DEBUG)
 -    else:
 -        logger.setLevel(logging.ERROR)
 +    for log_handler in root_logger.handlers[:]:
 +        root_logger.removeHandler(log_handler)
 +    for log_filter in root_logger.filters[:]:
 +        root_logger.removeFilter(log_filter)
++
 +    # Filter debug messages using BGD_MESSAGE_DEBUG value:
 +    debug_domains = os.environ.get('BGD_MESSAGE_DEBUG', None)
 +    if debug_domains:
 +        root_logger.addFilter(DebugFilter(debug_domains))

buildgrid/_app/commands/cmd_bot.py

@@ -34,7 +34,7 @@ from buildgrid.bot.hardware.worker import Worker
  from ..bots import buildbox, dummy, host
 -from ..cli import pass_context
 +from ..cli import pass_context, setup_logging
  @click.group(name='bot', short_help="Create and register bot clients.")
@@ -58,9 +58,12 @@ from ..cli import pass_context
                help="Time period for bot updates to the server in seconds.")
  @click.option('--parent', type=click.STRING, default='main', show_default=True,
                help="Targeted farm resource.")
 +@click.option('-v', '--verbose', count=True,
 +              help='Increase log verbosity level.')
  @pass_context
  def cli(context, parent, update_period, remote, client_key, client_cert, server_cert,
 -        remote_cas, cas_client_key, cas_client_cert, cas_server_cert):
 +        remote_cas, cas_client_key, cas_client_cert, cas_server_cert, verbose):
 +    setup_logging(verbosity=verbose)
      # Setup the remote execution server channel:
      url = urlparse(remote)
@@ -122,9 +125,8 @@ def cli(context, parent, update_period, remote, client_key, client_cert, server_
          context.cas_client_cert = context.client_cert
          context.cas_server_cert = context.server_cert
 -    click.echo("Starting for remote=[{}]".format(context.remote))
+-
      bot_interface = interface.BotInterface(context.channel)
++
      worker = Worker()
      worker.add_device(Device())
      hardware_interface = HardwareInterface(worker)

buildgrid/_app/commands/cmd_capabilities.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
 +import sys
 +from urllib.parse import urlparse
++
 +import click
 +import grpc
++
 +from buildgrid.client.capabilities import CapabilitiesInterface
++
 +from ..cli import pass_context
++
++
 +@click.command(name='capabilities', short_help="Capabilities service.")
 +@click.option('--remote', type=click.STRING, default='http://localhost:50051', show_default=True,
 +              help="Remote execution server's URL (port defaults to 50051 if no specified).")
 +@click.option('--client-key', type=click.Path(exists=True, dir_okay=False), default=None,
 +              help="Private client key for TLS (PEM-encoded)")
 +@click.option('--client-cert', type=click.Path(exists=True, dir_okay=False), default=None,
 +              help="Public client certificate for TLS (PEM-encoded)")
 +@click.option('--server-cert', type=click.Path(exists=True, dir_okay=False), default=None,
 +              help="Public server certificate for TLS (PEM-encoded)")
 +@click.option('--instance-name', type=click.STRING, default='main', show_default=True,
 +              help="Targeted farm instance name.")
 +@pass_context
 +def cli(context, remote, instance_name, client_key, client_cert, server_cert):
 +    click.echo("Getting capabilities...")
 +    url = urlparse(remote)
++
 +    remote = '{}:{}'.format(url.hostname, url.port or 50051)
 +    instance_name = instance_name
++
 +    if url.scheme == 'http':
 +        channel = grpc.insecure_channel(remote)
 +    else:
 +        credentials = context.load_client_credentials(client_key, client_cert, server_cert)
 +        if not credentials:
 +            click.echo("ERROR: no TLS keys were specified and no defaults could be found.", err=True)
 +            sys.exit(-1)
++
 +        channel = grpc.secure_channel(remote, credentials)
++
 +    interface = CapabilitiesInterface(channel)
 +    response = interface.get_capabilities(instance_name)
 +    click.echo(response)

buildgrid/_app/commands/cmd_server.py

@@ -26,7 +26,7 @@ import click
  from buildgrid.server.instance import BuildGridServer
 -from ..cli import pass_context
 +from ..cli import pass_context, setup_logging
  from ..settings import parser
@@ -37,9 +37,14 @@ def cli(context):
  @cli.command('start', short_help="Setup a new server instance.")
 -@click.argument('CONFIG', type=click.Path(file_okay=True, dir_okay=False, writable=False))
 +@click.argument('CONFIG',
 +                type=click.Path(file_okay=True, dir_okay=False, writable=False))
 +@click.option('-v', '--verbose', count=True,
 +              help='Increase log verbosity level.')
  @pass_context
 -def start(context, config):
 +def start(context, config, verbose):
 +    setup_logging(verbosity=verbose)
++
      with open(config) as f:
          settings = parser.get_parser().safe_load(f)

buildgrid/client/capabilities.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
 +import logging
 +import grpc
++
 +from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2, remote_execution_pb2_grpc
++
++
 +class CapabilitiesInterface:
 +    """Interface for calls the the Capabilities Service."""
++
 +    def __init__(self, channel):
 +        """Initialises an instance of the capabilities service.
++
 +        Args:
 +            channel (grpc.Channel): A gRPC channel to the CAS endpoint.
 +        """
 +        self.__logger = logging.getLogger(__name__)
 +        self.__stub = remote_execution_pb2_grpc.CapabilitiesStub(channel)
++
 +    def get_capabilities(self, instance_name):
 +        """Returns the capabilities or the server to the user.
++
 +        Args:
 +            instance_name (str): The name of the instance."""
++
 +        request = remote_execution_pb2.GetCapabilitiesRequest(instance_name=instance_name)
 +        try:
 +            return self.__stub.GetCapabilities(request)
++
 +        except grpc.RpcError as e:
 +            self.__logger.error(e)
 +            raise

buildgrid/client/cas.py

@@ -23,19 +23,13 @@ from buildgrid._exceptions import NotFoundError
  from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2, remote_execution_pb2_grpc
  from buildgrid._protos.google.bytestream import bytestream_pb2, bytestream_pb2_grpc
  from buildgrid._protos.google.rpc import code_pb2
 -from buildgrid.settings import HASH
 +from buildgrid.settings import HASH, MAX_REQUEST_SIZE, MAX_REQUEST_COUNT
  from buildgrid.utils import merkle_tree_maker
  # Maximum size for a queueable file:
  FILE_SIZE_THRESHOLD = 1 * 1024 * 1024
 -# Maximum size for a single gRPC request:
 -MAX_REQUEST_SIZE = 2 * 1024 * 1024
+-
 -# Maximum number of elements per gRPC request:
 -MAX_REQUEST_COUNT = 500
+-
  class _CallCache:
      """Per remote grpc.StatusCode.UNIMPLEMENTED call cache."""
@@ -390,11 +384,10 @@ class Downloader:
                  assert digest.hash in directories
                  directory = directories[digest.hash]
 -                self._write_directory(digest.hash, directory_path,
 +                self._write_directory(directory, directory_path,
                                        directories=directories, root_barrier=directory_path)
                  directory_fetched = True
+-
              except grpc.RpcError as e:
                  status_code = e.code()
                  if status_code == grpc.StatusCode.UNIMPLEMENTED:

buildgrid/server/_monitoring.py

@@ -156,9 +156,11 @@ class MonitoringBus:
                  output_writers.append(output_file)
                  while True:
 -                    if await __streaming_worker(iter(output_file)):
 +                    if await __streaming_worker([output_file]):
                          self.__sequence_number += 1
 +                        output_file.flush()
++
              else:
                  output_writers.append(sys.stdout.buffer)

buildgrid/server/bots/instance.py

@@ -37,6 +37,10 @@ class BotsInterface:
          self._assigned_leases = {}
          self._scheduler = scheduler
 +    @property
 +    def scheduler(self):
 +        return self._scheduler
++
      def register_instance_with_server(self, instance_name, server):
          server.add_bots_interface(self, instance_name)

buildgrid/server/bots/service.py

@@ -23,8 +23,9 @@ import logging
  import grpc
 -from google.protobuf.empty_pb2 import Empty
 +from google.protobuf import empty_pb2, timestamp_pb2
 +from buildgrid._enums import BotStatus
  from buildgrid._exceptions import InvalidArgumentError, OutOfSyncError
  from buildgrid._protos.google.devtools.remoteworkers.v1test2 import bots_pb2
  from buildgrid._protos.google.devtools.remoteworkers.v1test2 import bots_pb2_grpc
@@ -32,24 +33,86 @@ from buildgrid._protos.google.devtools.remoteworkers.v1test2 import bots_pb2_grp
  class BotsService(bots_pb2_grpc.BotsServicer):
 -    def __init__(self, server):
 +    def __init__(self, server, monitor=False):
          self.__logger = logging.getLogger(__name__)
 +        self.__bots_by_status = None
 +        self.__bots_by_instance = None
 +        self.__bots = None
++
          self._instances = {}
          bots_pb2_grpc.add_BotsServicer_to_server(self, server)
 -    def add_instance(self, name, instance):
 -        self._instances[name] = instance
 +        self._is_instrumented = monitor
++
 +        if self._is_instrumented:
 +            self.__bots_by_status = {}
 +            self.__bots_by_instance = {}
 +            self.__bots = {}
++
 +            self.__bots_by_status[BotStatus.OK] = set()
 +            self.__bots_by_status[BotStatus.UNHEALTHY] = set()
++
 +    # --- Public API ---
++
 +    def add_instance(self, instance_name, instance):
 +        """Registers a new servicer instance.
++
 +        Args:
 +            instance_name (str): The new instance's name.
 +            instance (BotsInterface): The new instance itself.
 +        """
 +        self._instances[instance_name] = instance
++
 +        if self._is_instrumented:
 +            self.__bots_by_instance[instance_name] = set()
++
 +    def get_scheduler(self, instance_name):
 +        """Retrieves a reference to the scheduler for an instance.
++
 +        Args:
 +            instance_name (str): The name of the instance to query.
++
 +        Returns:
 +            Scheduler: A reference to the scheduler for `instance_name`.
++
 +        Raises:
 +            InvalidArgumentError: If no instance named `instance_name` exists.
 +        """
 +        instance = self._get_instance(instance_name)
++
 +        return instance.scheduler
++
 +    # --- Public API: Servicer ---
      def CreateBotSession(self, request, context):
 +        """Handles CreateBotSessionRequest messages.
++
 +        Args:
 +            request (CreateBotSessionRequest): The incoming RPC request.
 +            context (grpc.ServicerContext): Context for the RPC call.
 +        """
          self.__logger.debug("CreateBotSession request from [%s]", context.peer())
 +        instance_name = request.parent
 +        bot_status = BotStatus(request.bot_session.status)
 +        bot_id = request.bot_session.bot_id
++
          try:
 -            parent = request.parent
 -            instance = self._get_instance(request.parent)
 -            return instance.create_bot_session(parent,
 -                                               request.bot_session)
 +            instance = self._get_instance(instance_name)
 +            bot_session = instance.create_bot_session(instance_name,
 +                                                      request.bot_session)
 +            now = timestamp_pb2.Timestamp()
 +            now.GetCurrentTime()
++
 +            if self._is_instrumented:
 +                self.__bots[bot_id] = now
 +                self.__bots_by_instance[instance_name].add(bot_id)
 +                if bot_status in self.__bots_by_status:
 +                    self.__bots_by_status[bot_status].add(bot_id)
++
 +            return bot_session
          except InvalidArgumentError as e:
              self.__logger.error(e)
@@ -59,17 +122,41 @@ class BotsService(bots_pb2_grpc.BotsServicer):
          return bots_pb2.BotSession()
      def UpdateBotSession(self, request, context):
 +        """Handles UpdateBotSessionRequest messages.
++
 +        Args:
 +            request (UpdateBotSessionRequest): The incoming RPC request.
 +            context (grpc.ServicerContext): Context for the RPC call.
 +        """
          self.__logger.debug("UpdateBotSession request from [%s]", context.peer())
 +        names = request.name.split("/")
 +        bot_status = BotStatus(request.bot_session.status)
 +        bot_id = request.bot_session.bot_id
++
          try:
 -            names = request.name.split("/")
 -            # Operation name should be in format:
 -            # {instance/name}/{uuid}
 -            instance_name = ''.join(names[0:-1])
 +            instance_name = '/'.join(names[:-1])
              instance = self._get_instance(instance_name)
 -            return instance.update_bot_session(request.name,
 -                                               request.bot_session)
 +            bot_session = instance.update_bot_session(request.name,
 +                                                      request.bot_session)
++
 +            if self._is_instrumented:
 +                self.__bots[bot_id].GetCurrentTime()
 +                if bot_id not in self.__bots_by_status[bot_status]:
 +                    if bot_status == BotStatus.OK:
 +                        self.__bots_by_status[BotStatus.OK].add(bot_id)
 +                        self.__bots_by_status[BotStatus.UNHEALTHY].discard(bot_id)
++
 +                    elif bot_status == BotStatus.UNHEALTHY:
 +                        self.__bots_by_status[BotStatus.OK].discard(bot_id)
 +                        self.__bots_by_status[BotStatus.UNHEALTHY].add(bot_id)
++
 +                    else:
 +                        self.__bots_by_instance[instance_name].remove(bot_id)
 +                        del self.__bots[bot_id]
++
 +            return bot_session
          except InvalidArgumentError as e:
              self.__logger.error(e)
@@ -89,10 +176,47 @@ class BotsService(bots_pb2_grpc.BotsServicer):
          return bots_pb2.BotSession()
      def PostBotEventTemp(self, request, context):
 +        """Handles PostBotEventTempRequest messages.
++
 +        Args:
 +            request (PostBotEventTempRequest): The incoming RPC request.
 +            context (grpc.ServicerContext): Context for the RPC call.
 +        """
          self.__logger.debug("PostBotEventTemp request from [%s]", context.peer())
          context.set_code(grpc.StatusCode.UNIMPLEMENTED)
 -        return Empty()
++
 +        return empty_pb2.Empty()
++
 +    # --- Public API: Monitoring ---
++
 +    @property
 +    def is_instrumented(self):
 +        return self._is_instrumented
++
 +    def query_n_bots(self):
 +        if self.__bots is not None:
 +            return len(self.__bots)
++
 +        return 0
++
 +    def query_n_bots_for_instance(self, instance_name):
 +        try:
 +            if self.__bots_by_instance is not None:
 +                return len(self.__bots_by_instance[instance_name])
 +        except KeyError:
 +            pass
 +        return 0
++
 +    def query_n_bots_for_status(self, bot_status):
 +        try:
 +            if self.__bots_by_status is not None:
 +                return len(self.__bots_by_status[bot_status])
 +        except KeyError:
 +            pass
 +        return 0
++
 +    # --- Private API ---
      def _get_instance(self, name):
          try:

buildgrid/server/capabilities/__init__.py

buildgrid/server/capabilities/instance.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
 +import logging
++
 +from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2
++
++
 +class CapabilitiesInstance:
++
 +    def __init__(self, cas_instance=None, action_cache_instance=None, execution_instance=None):
 +        self.__logger = logging.getLogger(__name__)
 +        self.__cas_instance = cas_instance
 +        self.__action_cache_instance = action_cache_instance
 +        self.__execution_instance = execution_instance
++
 +    def register_instance_with_server(self, instance_name, server):
 +        server.add_capabilities_instance(self, instance_name)
++
 +    def add_cas_instance(self, cas_instance):
 +        self.__cas_instance = cas_instance
++
 +    def add_action_cache_instance(self, action_cache_instance):
 +        self.__action_cache_instance = action_cache_instance
++
 +    def add_execution_instance(self, execution_instance):
 +        self.__execution_instance = execution_instance
++
 +    def get_capabilities(self):
 +        server_capabilities = remote_execution_pb2.ServerCapabilities()
 +        server_capabilities.cache_capabilities.CopyFrom(self._get_cache_capabilities())
 +        server_capabilities.execution_capabilities.CopyFrom(self._get_capabilities_execution())
 +        # TODO
 +        # When API is stable, fill out SemVer values
 +        # server_capabilities.deprecated_api_version =
 +        # server_capabilities.low_api_version =
 +        # server_capabilities.low_api_version =
 +        # server_capabilities.hig_api_version =
 +        return server_capabilities
++
 +    def _get_cache_capabilities(self):
 +        capabilities = remote_execution_pb2.CacheCapabilities()
 +        action_cache_update_capabilities = remote_execution_pb2.ActionCacheUpdateCapabilities()
++
 +        if self.__cas_instance:
 +            capabilities.digest_function.extend([self.__cas_instance.hash_type()])
 +            capabilities.max_batch_total_size_bytes = self.__cas_instance.max_batch_total_size_bytes()
 +            capabilities.symlink_absolute_path_strategy = self.__cas_instance.symlink_absolute_path_strategy()
 +            # TODO: execution priority #102
 +            # capabilities.cache_priority_capabilities =
++
 +        if self.__action_cache_instance:
 +            action_cache_update_capabilities.update_enabled = self.__action_cache_instance.allow_updates
++
 +        capabilities.action_cache_update_capabilities.CopyFrom(action_cache_update_capabilities)
 +        return capabilities
++
 +    def _get_capabilities_execution(self):
 +        capabilities = remote_execution_pb2.ExecutionCapabilities()
 +        if self.__execution_instance:
 +            capabilities.exec_enabled = True
 +            capabilities.digest_function = self.__execution_instance.hash_type()
 +            # TODO: execution priority #102
 +            # capabilities.execution_priority =
++
 +        else:
 +            capabilities.exec_enabled = False
++
 +        return capabilities

buildgrid/server/capabilities/service.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
 +import logging
++
 +import grpc
++
 +from buildgrid._exceptions import InvalidArgumentError
 +from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2, remote_execution_pb2_grpc
++
++
 +class CapabilitiesService(remote_execution_pb2_grpc.CapabilitiesServicer):
++
 +    def __init__(self, server):
 +        self.__logger = logging.getLogger(__name__)
 +        self.__instances = {}
 +        remote_execution_pb2_grpc.add_CapabilitiesServicer_to_server(self, server)
++
 +    def add_instance(self, name, instance):
 +        self.__instances[name] = instance
++
 +    def add_cas_instance(self, name, instance):
 +        self.__instances[name].add_cas_instance(instance)
++
 +    def add_action_cache_instance(self, name, instance):
 +        self.__instances[name].add_action_cache_instance(instance)
++
 +    def add_execution_instance(self, name, instance):
 +        self.__instances[name].add_execution_instance(instance)
++
 +    def GetCapabilities(self, request, context):
 +        try:
 +            instance = self._get_instance(request.instance_name)
 +            return instance.get_capabilities()
++
 +        except InvalidArgumentError as e:
 +            self.__logger.error(e)
 +            context.set_details(str(e))
 +            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
++
 +        return remote_execution_pb2.ServerCapabilities()
++
 +    def _get_instance(self, name):
 +        try:
 +            return self.__instances[name]
++
 +        except KeyError:
 +            raise InvalidArgumentError("Instance doesn't exist on server: [{}]".format(name))

buildgrid/server/cas/instance.py

@@ -24,7 +24,8 @@ import logging
  from buildgrid._exceptions import InvalidArgumentError, NotFoundError, OutOfRangeError
  from buildgrid._protos.google.bytestream import bytestream_pb2
  from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2 as re_pb2
 -from buildgrid.settings import HASH, HASH_LENGTH
 +from buildgrid.settings import HASH, HASH_LENGTH, MAX_REQUEST_SIZE, MAX_REQUEST_COUNT
 +from buildgrid.utils import get_hash_type
  class ContentAddressableStorageInstance:
@@ -37,6 +38,17 @@ class ContentAddressableStorageInstance:
      def register_instance_with_server(self, instance_name, server):
          server.add_cas_instance(self, instance_name)
 +    def hash_type(self):
 +        return get_hash_type()
++
 +    def max_batch_total_size_bytes(self):
 +        return MAX_REQUEST_SIZE
++
 +    def symlink_absolute_path_strategy(self):
 +        # Currently this strategy is hardcoded into BuildGrid
 +        # With no setting to reference
 +        return re_pb2.CacheCapabilities().DISALLOWED
++
      def find_missing_blobs(self, blob_digests):
          storage = self._storage
          return re_pb2.FindMissingBlobsResponse(
@@ -58,6 +70,41 @@ class ContentAddressableStorageInstance:
          return response
 +    def get_tree(self, request):
 +        storage = self._storage
++
 +        response = re_pb2.GetTreeResponse()
 +        page_size = request.page_size
++
 +        if not request.page_size:
 +            request.page_size = MAX_REQUEST_COUNT
++
 +        root_digest = request.root_digest
 +        page_size = request.page_size
++
 +        def __get_tree(node_digest):
 +            nonlocal response, page_size, request
++
 +            if not page_size:
 +                page_size = request.page_size
 +                yield response
 +                response = re_pb2.GetTreeResponse()
++
 +            if response.ByteSize() >= (MAX_REQUEST_SIZE):
 +                yield response
 +                response = re_pb2.GetTreeResponse()
++
 +            directory_from_digest = storage.get_message(node_digest, re_pb2.Directory)
 +            page_size -= 1
 +            response.directories.extend([directory_from_digest])
++
 +            for directory in directory_from_digest.directories:
 +                yield from __get_tree(directory.digest)
++
 +            yield response
++
 +        return __get_tree(root_digest)
++
  class ByteStreamInstance:

buildgrid/server/cas/service.py

@@ -86,10 +86,16 @@ class ContentAddressableStorageService(remote_execution_pb2_grpc.ContentAddressa
      def GetTree(self, request, context):
          self.__logger.debug("GetTree request from [%s]", context.peer())
 -        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
 -        context.set_details('Method not implemented!')
 +        try:
 +            instance = self._get_instance(request.instance_name)
 +            yield from instance.get_tree(request)
++
 +        except InvalidArgumentError as e:
 +            self.__logger.error(e)
 +            context.set_details(str(e))
 +            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
 -        return iter([remote_execution_pb2.GetTreeResponse()])
 +            yield remote_execution_pb2.GetTreeResponse()
      def _get_instance(self, instance_name):
          try:

buildgrid/server/execution/instance.py

@@ -25,6 +25,7 @@ from buildgrid._exceptions import FailedPreconditionError, InvalidArgumentError
  from buildgrid._protos.build.bazel.remote.execution.v2.remote_execution_pb2 import Action
  from ..job import Job
 +from ...utils import get_hash_type
  class ExecutionInstance:
@@ -35,9 +36,16 @@ class ExecutionInstance:
          self._storage = storage
          self._scheduler = scheduler
 +    @property
 +    def scheduler(self):
 +        return self._scheduler
++
      def register_instance_with_server(self, instance_name, server):
          server.add_execution_instance(self, instance_name)
 +    def hash_type(self):
 +        return get_hash_type()
++
      def execute(self, action_digest, skip_cache_lookup, message_queue=None):
          """ Sends a job for execution.
          Queues an action and creates an Operation instance to be associated with

buildgrid/server/execution/service.py

@@ -33,30 +33,84 @@ from buildgrid._protos.google.longrunning import operations_pb2
  class ExecutionService(remote_execution_pb2_grpc.ExecutionServicer):
 -    def __init__(self, server):
 +    def __init__(self, server, monitor=False):
          self.__logger = logging.getLogger(__name__)
 +        self.__peers_by_instance = None
 +        self.__peers = None
++
          self._instances = {}
++
          remote_execution_pb2_grpc.add_ExecutionServicer_to_server(self, server)
 -    def add_instance(self, name, instance):
 -        self._instances[name] = instance
 +        self._is_instrumented = monitor
++
 +        if self._is_instrumented:
 +            self.__peers_by_instance = {}
 +            self.__peers = {}
++
 +    # --- Public API ---
++
 +    def add_instance(self, instance_name, instance):
 +        """Registers a new servicer instance.
++
 +        Args:
 +            instance_name (str): The new instance's name.
 +            instance (ExecutionInstance): The new instance itself.
 +        """
 +        self._instances[instance_name] = instance
++
 +        if self._is_instrumented:
 +            self.__peers_by_instance[instance_name] = set()
++
 +    def get_scheduler(self, instance_name):
 +        """Retrieves a reference to the scheduler for an instance.
++
 +        Args:
 +            instance_name (str): The name of the instance to query.
++
 +        Returns:
 +            Scheduler: A reference to the scheduler for `instance_name`.
++
 +        Raises:
 +            InvalidArgumentError: If no instance named `instance_name` exists.
 +        """
 +        instance = self._get_instance(instance_name)
++
 +        return instance.scheduler
++
 +    # --- Public API: Servicer ---
      def Execute(self, request, context):
 +        """Handles ExecuteRequest messages.
++
 +        Args:
 +            request (ExecuteRequest): The incoming RPC request.
 +            context (grpc.ServicerContext): Context for the RPC call.
 +        """
          self.__logger.debug("Execute request from [%s]", context.peer())
 +        instance_name = request.instance_name
 +        message_queue = queue.Queue()
 +        peer = context.peer()
++
          try:
 -            message_queue = queue.Queue()
 -            instance = self._get_instance(request.instance_name)
 +            instance = self._get_instance(instance_name)
              operation = instance.execute(request.action_digest,
                                           request.skip_cache_lookup,
                                           message_queue)
 -            context.add_callback(partial(instance.unregister_message_client,
 -                                         operation.name, message_queue))
 +            context.add_callback(partial(self._rpc_termination_callback,
 +                                         peer, instance_name, operation.name, message_queue))
 -            instanced_op_name = "{}/{}".format(request.instance_name,
 -                                               operation.name)
 +            if self._is_instrumented:
 +                if peer not in self.__peers:
 +                    self.__peers_by_instance[instance_name].add(peer)
 +                    self.__peers[peer] = 1
 +                else:
 +                    self.__peers[peer] += 1
++
 +            instanced_op_name = "{}/{}".format(instance_name, operation.name)
              self.__logger.info("Operation name: [%s]", instanced_op_name)
@@ -86,23 +140,33 @@ class ExecutionService(remote_execution_pb2_grpc.ExecutionServicer):
              yield operations_pb2.Operation()
      def WaitExecution(self, request, context):
 -        self.__logger.debug("WaitExecution request from [%s]", context.peer())
 +        """Handles WaitExecutionRequest messages.
 -        try:
 -            names = request.name.split("/")
 +        Args:
 +            request (WaitExecutionRequest): The incoming RPC request.
 +            context (grpc.ServicerContext): Context for the RPC call.
 +        """
 +        self.__logger.debug("WaitExecution request from [%s]", context.peer())
 -            # Operation name should be in format:
 -            # {instance/name}/{operation_id}
 -            instance_name = ''.join(names[0:-1])
 +        names = request.name.split('/')
 +        instance_name = '/'.join(names[:-1])
 +        operation_name = names[-1]
 +        message_queue = queue.Queue()
 +        peer = context.peer()
 -            message_queue = queue.Queue()
 -            operation_name = names[-1]
 +        try:
              instance = self._get_instance(instance_name)
              instance.register_message_client(operation_name, message_queue)
 +            context.add_callback(partial(self._rpc_termination_callback,
 +                                         peer, instance_name, operation_name, message_queue))
 -            context.add_callback(partial(instance.unregister_message_client,
 -                                         operation_name, message_queue))
 +            if self._is_instrumented:
 +                if peer not in self.__peers:
 +                    self.__peers_by_instance[instance_name].add(peer)
 +                    self.__peers[peer] = 1
 +                else:
 +                    self.__peers[peer] += 1
              for operation in instance.stream_operation_updates(message_queue,
                                                                 operation_name):
@@ -123,6 +187,39 @@ class ExecutionService(remote_execution_pb2_grpc.ExecutionServicer):
              context.set_code(grpc.StatusCode.CANCELLED)
              yield operations_pb2.Operation()
 +    # --- Public API: Monitoring ---
++
 +    @property
 +    def is_instrumented(self):
 +        return self._is_instrumented
++
 +    def query_n_clients(self):
 +        if self.__peers is not None:
 +            return len(self.__peers)
 +        return 0
++
 +    def query_n_clients_for_instance(self, instance_name):
 +        try:
 +            if self.__peers_by_instance is not None:
 +                return len(self.__peers_by_instance[instance_name])
 +        except KeyError:
 +            pass
 +        return 0
++
 +    # --- Private API ---
++
 +    def _rpc_termination_callback(self, peer, instance_name, job_name, message_queue):
 +        instance = self._get_instance(instance_name)
++
 +        instance.unregister_message_client(job_name, message_queue)
++
 +        if self._is_instrumented:
 +            if self.__peers[peer] > 1:
 +                self.__peers[peer] -= 1
 +            else:
 +                self.__peers_by_instance[instance_name].remove(peer)
 +                del self.__peers[peer]
++
      def _get_instance(self, name):
          try:
              return self._instances[name]

buildgrid/server/instance.py

@@ -15,19 +15,29 @@
  import asyncio
  from concurrent import futures
 +from datetime import datetime, timedelta
  import logging
 +import logging.handlers
  import os
  import signal
 +import sys
 +import time
  import grpc
 +import janus
 +from buildgrid._enums import BotStatus, LogRecordLevel, MetricRecordDomain, MetricRecordType
 +from buildgrid._protos.buildgrid.v2 import monitoring_pb2
  from buildgrid.server.actioncache.service import ActionCacheService
  from buildgrid.server.bots.service import BotsService
 +from buildgrid.server.capabilities.instance import CapabilitiesInstance
 +from buildgrid.server.capabilities.service import CapabilitiesService
  from buildgrid.server.cas.service import ByteStreamService, ContentAddressableStorageService
  from buildgrid.server.execution.service import ExecutionService
  from buildgrid.server._monitoring import MonitoringBus, MonitoringOutputType, MonitoringOutputFormat
  from buildgrid.server.operations.service import OperationsService
  from buildgrid.server.referencestorage.service import ReferenceStorageService
 +from buildgrid.settings import LOG_RECORD_FORMAT, MONITORING_PERIOD
  class BuildGridServer:
@@ -53,8 +63,23 @@ class BuildGridServer:
          self.__grpc_server = grpc.server(self.__grpc_executor)
          self.__main_loop = asyncio.get_event_loop()
++
          self.__monitoring_bus = None
 +        self.__logging_queue = janus.Queue(loop=self.__main_loop)
 +        self.__logging_handler = logging.handlers.QueueHandler(self.__logging_queue.sync_q)
 +        self.__logging_formatter = logging.Formatter(fmt=LOG_RECORD_FORMAT)
 +        self.__print_log_records = True
++
 +        self.__build_metadata_queues = None
++
 +        self.__state_monitoring_task = None
 +        self.__build_monitoring_tasks = None
 +        self.__logging_task = None
++
 +        # We always want a capabilities service
 +        self._capabilities_service = CapabilitiesService(self.__grpc_server)
++
          self._execution_service = None
          self._bots_service = None
          self._operations_service = None
@@ -63,6 +88,9 @@ class BuildGridServer:
          self._cas_service = None
          self._bytestream_service = None
 +        self._schedulers = {}
 +        self._instances = set()
++
          self._is_instrumented = monitor
          if self._is_instrumented:
@@ -70,6 +98,19 @@ class BuildGridServer:
                  self.__main_loop, endpoint_type=MonitoringOutputType.STDOUT,
                  serialisation_format=MonitoringOutputFormat.JSON)
 +            self.__build_monitoring_tasks = []
++
 +        # Setup the main logging handler:
 +        root_logger = logging.getLogger()
++
 +        for log_filter in root_logger.filters[:]:
 +            self.__logging_handler.addFilter(log_filter)
 +            root_logger.removeFilter(log_filter)
++
 +        for log_handler in root_logger.handlers[:]:
 +            root_logger.removeHandler(log_handler)
 +        root_logger.addHandler(self.__logging_handler)
++
      # --- Public API ---
      def start(self):
@@ -79,6 +120,25 @@ class BuildGridServer:
          if self._is_instrumented:
              self.__monitoring_bus.start()
 +            self.__state_monitoring_task = asyncio.ensure_future(
 +                self._state_monitoring_worker(period=MONITORING_PERIOD),
 +                loop=self.__main_loop)
++
 +            self.__build_monitoring_tasks.clear()
 +            for instance_name, scheduler in self._schedulers.items():
 +                if not scheduler.is_instrumented:
 +                    continue
++
 +                message_queue = janus.Queue(loop=self.__main_loop)
 +                scheduler.register_build_metadata_watcher(message_queue.sync_q)
++
 +                self.__build_monitoring_tasks.append(asyncio.ensure_future(
 +                    self._build_monitoring_worker(instance_name, message_queue),
 +                    loop=self.__main_loop))
++
 +        self.__logging_task = asyncio.ensure_future(
 +            self._logging_worker(), loop=self.__main_loop)
++
          self.__main_loop.add_signal_handler(signal.SIGTERM, self.stop)
          self.__main_loop.run_forever()
@@ -86,8 +146,18 @@ class BuildGridServer:
      def stop(self):
          """Stops the BuildGrid server."""
          if self._is_instrumented:
 +            if self.__state_monitoring_task is not None:
 +                self.__state_monitoring_task.cancel()
++
 +            for build_monitoring_task in self.__build_monitoring_tasks:
 +                build_monitoring_task.cancel()
 +            self.__build_monitoring_tasks.clear()
++
              self.__monitoring_bus.stop()
 +        if self.__logging_task is not None:
 +            self.__logging_task.cancel()
++
          self.__main_loop.stop()
          self.__grpc_server.stop(None)
@@ -125,9 +195,14 @@ class BuildGridServer:
              instance_name (str): Instance name.
          """
          if self._execution_service is None:
 -            self._execution_service = ExecutionService(self.__grpc_server)
 +            self._execution_service = ExecutionService(
 +                self.__grpc_server, monitor=self._is_instrumented)
          self._execution_service.add_instance(instance_name, instance)
 +        self._add_capabilities_instance(instance_name, execution_instance=instance)
++
 +        self._schedulers[instance_name] = instance.scheduler
 +        self._instances.add(instance_name)
      def add_bots_interface(self, instance, instance_name):
          """Adds a :obj:`BotsInterface` to the service.
@@ -139,10 +214,13 @@ class BuildGridServer:
              instance_name (str): Instance name.
          """
          if self._bots_service is None:
 -            self._bots_service = BotsService(self.__grpc_server)
 +            self._bots_service = BotsService(
 +                self.__grpc_server, monitor=self._is_instrumented)
          self._bots_service.add_instance(instance_name, instance)
 +        self._instances.add(instance_name)
++
      def add_operations_instance(self, instance, instance_name):
          """Adds an :obj:`OperationsInstance` to the service.
@@ -184,9 +262,10 @@ class BuildGridServer:
              self._action_cache_service = ActionCacheService(self.__grpc_server)
          self._action_cache_service.add_instance(instance_name, instance)
 +        self._add_capabilities_instance(instance_name, action_cache_instance=instance)
      def add_cas_instance(self, instance, instance_name):
 -        """Stores a :obj:`ContentAddressableStorageInstance` to the service.
 +        """Adds a :obj:`ContentAddressableStorageInstance` to the service.
          If no service exists, it creates one.
@@ -198,9 +277,10 @@ class BuildGridServer:
              self._cas_service = ContentAddressableStorageService(self.__grpc_server)
          self._cas_service.add_instance(instance_name, instance)
 +        self._add_capabilities_instance(instance_name, cas_instance=instance)
      def add_bytestream_instance(self, instance, instance_name):
 -        """Stores a :obj:`ByteStreamInstance` to the service.
 +        """Adds a :obj:`ByteStreamInstance` to the service.
          If no service exists, it creates one.
@@ -218,3 +298,279 @@ class BuildGridServer:
      @property
      def is_instrumented(self):
          return self._is_instrumented
++
 +    # --- Private API ---
++
 +    def _add_capabilities_instance(self, instance_name,
 +                                   cas_instance=None,
 +                                   action_cache_instance=None,
 +                                   execution_instance=None):
 +        """Adds a :obj:`CapabilitiesInstance` to the service.
++
 +        Args:
 +            instance (:obj:`CapabilitiesInstance`): Instance to add.
 +            instance_name (str): Instance name.
 +        """
++
 +        try:
 +            if cas_instance:
 +                self._capabilities_service.add_cas_instance(instance_name, cas_instance)
 +            if action_cache_instance:
 +                self._capabilities_service.add_action_cache_instance(instance_name, action_cache_instance)
 +            if execution_instance:
 +                self._capabilities_service.add_execution_instance(instance_name, execution_instance)
++
 +        except KeyError:
 +            capabilities_instance = CapabilitiesInstance(cas_instance,
 +                                                         action_cache_instance,
 +                                                         execution_instance)
 +            self._capabilities_service.add_instance(instance_name, capabilities_instance)
++
 +    async def _logging_worker(self):
 +        """Publishes log records to the monitoring bus."""
 +        async def __logging_worker():
 +            log_record = await self.__logging_queue.async_q.get()
++
 +            # Print log records to stdout, if required:
 +            if self.__print_log_records:
 +                record = self.__logging_formatter.format(log_record)
++
 +                # TODO: Investigate if async write would be worth here.
 +                sys.stdout.write('{}\n'.format(record))
 +                sys.stdout.flush()
++
 +            # Emit a log record if server is instrumented:
 +            if self._is_instrumented:
 +                log_record_level = LogRecordLevel(int(log_record.levelno / 10))
 +                log_record_creation_time = datetime.fromtimestamp(log_record.created)
 +                # logging.LogRecord.extra must be a str to str dict:
 +                if 'extra' in log_record.__dict__ and log_record.extra:
 +                    log_record_metadata = log_record.extra
 +                else:
 +                    log_record_metadata = None
 +                record = self._forge_log_record(
 +                    log_record.name, log_record_level, log_record.message,
 +                    log_record_creation_time, metadata=log_record_metadata)
++
 +                await self.__monitoring_bus.send_record(record)
++
 +        try:
 +            while True:
 +                await __logging_worker()
++
 +        except asyncio.CancelledError:
 +            pass
++
 +    def _forge_log_record(self, domain, level, message, creation_time, metadata=None):
 +        log_record = monitoring_pb2.LogRecord()
++
 +        log_record.creation_timestamp.FromDatetime(creation_time)
 +        log_record.domain = domain
 +        log_record.level = level.value
 +        log_record.message = message
 +        if metadata is not None:
 +            log_record.metadata.update(metadata)
++
 +        return log_record
++
 +    async def _build_monitoring_worker(self, instance_name, message_queue):
 +        """Publishes builds metadata to the monitoring bus."""
 +        async def __build_monitoring_worker():
 +            metadata = await message_queue.async_q.get()
++
 +            # Emit build inputs fetching time record:
 +            fetch_start = metadata.input_fetch_start_timestamp.ToDatetime()
 +            fetch_completed = metadata.input_fetch_completed_timestamp.ToDatetime()
 +            input_fetch_time = fetch_completed - fetch_start
 +            timer_record = self._forge_timer_metric_record(
 +                MetricRecordDomain.BUILD, 'inputs-fetching-time', input_fetch_time,
 +                metadata={'instance-name': instance_name or 'void'})
++
 +            await self.__monitoring_bus.send_record(timer_record)
++
 +            # Emit build execution time record:
 +            execution_start = metadata.execution_start_timestamp.ToDatetime()
 +            execution_completed = metadata.execution_completed_timestamp.ToDatetime()
 +            execution_time = execution_completed - execution_start
 +            timer_record = self._forge_timer_metric_record(
 +                MetricRecordDomain.BUILD, 'execution-time', execution_time,
 +                metadata={'instance-name': instance_name or 'void'})
++
 +            await self.__monitoring_bus.send_record(timer_record)
++
 +            # Emit build outputs uploading time record:
 +            upload_start = metadata.output_upload_start_timestamp.ToDatetime()
 +            upload_completed = metadata.output_upload_completed_timestamp.ToDatetime()
 +            output_upload_time = upload_completed - upload_start
 +            timer_record = self._forge_timer_metric_record(
 +                MetricRecordDomain.BUILD, 'outputs-uploading-time', output_upload_time,
 +                metadata={'instance-name': instance_name or 'void'})
++
 +            await self.__monitoring_bus.send_record(timer_record)
++
 +            # Emit total build handling time record:
 +            queued = metadata.queued_timestamp.ToDatetime()
 +            worker_completed = metadata.worker_completed_timestamp.ToDatetime()
 +            total_handling_time = worker_completed - queued
 +            timer_record = self._forge_timer_metric_record(
 +                MetricRecordDomain.BUILD, 'total-handling-time', total_handling_time,
 +                metadata={'instance-name': instance_name or 'void'})
++
 +            await self.__monitoring_bus.send_record(timer_record)
++
 +        try:
 +            while True:
 +                await __build_monitoring_worker()
++
 +        except asyncio.CancelledError:
 +            pass
++
 +    async def _state_monitoring_worker(self, period=1.0):
 +        """Periodically publishes state metrics to the monitoring bus."""
 +        async def __state_monitoring_worker():
 +            # Emit total clients count record:
 +            _, record = self._query_n_clients()
 +            await self.__monitoring_bus.send_record(record)
++
 +            # Emit total bots count record:
 +            _, record = self._query_n_bots()
 +            await self.__monitoring_bus.send_record(record)
++
 +            queue_times = []
 +            # Emits records by instance:
 +            for instance_name in self._instances:
 +                # Emit instance clients count record:
 +                _, record = self._query_n_clients_for_instance(instance_name)
 +                await self.__monitoring_bus.send_record(record)
++
 +                # Emit instance bots count record:
 +                _, record = self._query_n_bots_for_instance(instance_name)
 +                await self.__monitoring_bus.send_record(record)
++
 +                # Emit instance average queue time record:
 +                queue_time, record = self._query_am_queue_time_for_instance(instance_name)
 +                await self.__monitoring_bus.send_record(record)
 +                if queue_time:
 +                    queue_times.append(queue_time)
++
 +            # Emits records by bot status:
 +            for bot_status in [BotStatus.OK, BotStatus.UNHEALTHY]:
 +                # Emit status bots count record:
 +                _, record = self._query_n_bots_for_status(bot_status)
 +                await self.__monitoring_bus.send_record(record)
++
 +            # Emit overall average queue time record:
 +            if queue_times:
 +                am_queue_time = sum(queue_times, timedelta()) / len(queue_times)
 +            else:
 +                am_queue_time = timedelta()
 +            record = self._forge_timer_metric_record(
 +                MetricRecordDomain.STATE,
 +                'average-queue-time',
 +                am_queue_time)
++
 +            await self.__monitoring_bus.send_record(record)
++
 +        try:
 +            while True:
 +                start = time.time()
 +                await __state_monitoring_worker()
++
 +                end = time.time()
 +                await asyncio.sleep(period - (end - start))
++
 +        except asyncio.CancelledError:
 +            pass
++
 +    def _forge_counter_metric_record(self, domain, name, count, metadata=None):
 +        counter_record = monitoring_pb2.MetricRecord()
++
 +        counter_record.creation_timestamp.GetCurrentTime()
 +        counter_record.domain = domain.value
 +        counter_record.type = MetricRecordType.COUNTER.value
 +        counter_record.name = name
 +        counter_record.count = count
 +        if metadata is not None:
 +            counter_record.metadata.update(metadata)
++
 +        return counter_record
++
 +    def _forge_timer_metric_record(self, domain, name, duration, metadata=None):
 +        timer_record = monitoring_pb2.MetricRecord()
++
 +        timer_record.creation_timestamp.GetCurrentTime()
 +        timer_record.domain = domain.value
 +        timer_record.type = MetricRecordType.TIMER.value
 +        timer_record.name = name
 +        timer_record.duration.FromTimedelta(duration)
 +        if metadata is not None:
 +            timer_record.metadata.update(metadata)
++
 +        return timer_record
++
 +    def _forge_gauge_metric_record(self, domain, name, value, metadata=None):
 +        gauge_record = monitoring_pb2.MetricRecord()
++
 +        gauge_record.creation_timestamp.GetCurrentTime()
 +        gauge_record.domain = domain.value
 +        gauge_record.type = MetricRecordType.GAUGE.value
 +        gauge_record.name = name
 +        gauge_record.value = value
 +        if metadata is not None:
 +            gauge_record.metadata.update(metadata)
++
 +        return gauge_record
++
 +    # --- Private API: Monitoring ---
++
 +    def _query_n_clients(self):
 +        """Queries the number of clients connected."""
 +        n_clients = self._execution_service.query_n_clients()
 +        gauge_record = self._forge_gauge_metric_record(
 +            MetricRecordDomain.STATE, 'clients-count', n_clients)
++
 +        return n_clients, gauge_record
++
 +    def _query_n_clients_for_instance(self, instance_name):
 +        """Queries the number of clients connected for a given instance"""
 +        n_clients = self._execution_service.query_n_clients_for_instance(instance_name)
 +        gauge_record = self._forge_gauge_metric_record(
 +            MetricRecordDomain.STATE, 'clients-count', n_clients,
 +            metadata={'instance-name': instance_name or 'void'})
++
 +        return n_clients, gauge_record
++
 +    def _query_n_bots(self):
 +        """Queries the number of bots connected."""
 +        n_bots = self._bots_service.query_n_bots()
 +        gauge_record = self._forge_gauge_metric_record(
 +            MetricRecordDomain.STATE, 'bots-count', n_bots)
++
 +        return n_bots, gauge_record
++
 +    def _query_n_bots_for_instance(self, instance_name):
 +        """Queries the number of bots connected for a given instance."""
 +        n_bots = self._bots_service.query_n_bots_for_instance(instance_name)
 +        gauge_record = self._forge_gauge_metric_record(
 +            MetricRecordDomain.STATE, 'bots-count', n_bots,
 +            metadata={'instance-name': instance_name or 'void'})
++
 +        return n_bots, gauge_record
++
 +    def _query_n_bots_for_status(self, bot_status):
 +        """Queries the number of bots connected for a given health status."""
 +        n_bots = self._bots_service.query_n_bots_for_status(bot_status)
 +        gauge_record = self._forge_gauge_metric_record(
 +            MetricRecordDomain.STATE, 'bots-count', n_bots,
 +            metadata={'bot-status': bot_status.name})
++
 +        return n_bots, gauge_record
++
 +    def _query_am_queue_time_for_instance(self, instance_name):
 +        """Queries the average job's queue time for a given instance."""
 +        am_queue_time = self._schedulers[instance_name].query_am_queue_time()
 +        timer_record = self._forge_timer_metric_record(
 +            MetricRecordDomain.STATE, 'average-queue-time', am_queue_time,
 +            metadata={'instance-name': instance_name or 'void'})
++
 +        return am_queue_time, timer_record

buildgrid/server/job.py

@@ -13,10 +13,11 @@
  # limitations under the License.
 +from datetime import datetime
  import logging
  import uuid
 -from google.protobuf import timestamp_pb2
 +from google.protobuf import duration_pb2, timestamp_pb2
  from buildgrid._enums import LeaseState, OperationStage
  from buildgrid._exceptions import CancelledError
@@ -40,6 +41,7 @@ class Job:
          self.__operation_metadata = remote_execution_pb2.ExecuteOperationMetadata()
          self.__queued_timestamp = timestamp_pb2.Timestamp()
 +        self.__queued_time_duration = duration_pb2.Duration()
          self.__worker_start_timestamp = timestamp_pb2.Timestamp()
          self.__worker_completed_timestamp = timestamp_pb2.Timestamp()
@@ -56,6 +58,8 @@ class Job:
          self._operation.done = False
          self._n_tries = 0
 +    # --- Public API ---
++
      @property
      def name(self):
          return self._name
@@ -79,6 +83,13 @@ class Job:
          else:
              return None
 +    @property
 +    def holds_cached_action_result(self):
 +        if self.__execute_response is not None:
 +            return self.__execute_response.cached_result
 +        else:
 +            return False
++
      @property
      def operation(self):
          return self._operation
@@ -193,7 +204,7 @@ class Job:
                  result.Unpack(action_result)
              action_metadata = action_result.execution_metadata
 -            action_metadata.queued_timestamp.CopyFrom(self.__worker_start_timestamp)
 +            action_metadata.queued_timestamp.CopyFrom(self.__queued_timestamp)
              action_metadata.worker_start_timestamp.CopyFrom(self.__worker_start_timestamp)
              action_metadata.worker_completed_timestamp.CopyFrom(self.__worker_completed_timestamp)
@@ -227,6 +238,10 @@ class Job:
                  self.__queued_timestamp.GetCurrentTime()
              self._n_tries += 1
 +        elif self.__operation_metadata.stage == OperationStage.EXECUTING.value:
 +            queue_in, queue_out = self.__queued_timestamp.ToDatetime(), datetime.now()
 +            self.__queued_time_duration.FromTimedelta(queue_out - queue_in)
++
          elif self.__operation_metadata.stage == OperationStage.COMPLETED.value:
              if self.__execute_response is not None:
                  self._operation.response.Pack(self.__execute_response)
@@ -260,3 +275,11 @@ class Job:
          self.__execute_response.status.message = "Operation cancelled by client."
          self.update_operation_stage(OperationStage.COMPLETED)
++
 +    # --- Public API: Monitoring ---
++
 +    def query_queue_time(self):
 +        return self.__queued_time_duration.ToTimedelta()
++
 +    def query_n_retries(self):
 +        return self._n_tries - 1 if self._n_tries > 0 else 0

buildgrid/server/operations/instance.py

@@ -32,6 +32,10 @@ class OperationsInstance:
          self._scheduler = scheduler
 +    @property
 +    def scheduler(self):
 +        return self._scheduler
++
      def register_instance_with_server(self, instance_name, server):
          server.add_operations_instance(self, instance_name)

buildgrid/server/operations/service.py

@@ -38,8 +38,18 @@ class OperationsService(operations_pb2_grpc.OperationsServicer):
          operations_pb2_grpc.add_OperationsServicer_to_server(self, server)
 -    def add_instance(self, name, instance):
 -        self._instances[name] = instance
 +    # --- Public API ---
++
 +    def add_instance(self, instance_name, instance):
 +        """Registers a new servicer instance.
++
 +        Args:
 +            instance_name (str): The new instance's name.
 +            instance (OperationsInstance): The new instance itself.
 +        """
 +        self._instances[instance_name] = instance
++
 +    # --- Public API: Servicer ---
      def GetOperation(self, request, context):
          self.__logger.debug("GetOperation request from [%s]", context.peer())
@@ -127,6 +137,8 @@ class OperationsService(operations_pb2_grpc.OperationsServicer):
          return Empty()
 +    # --- Private API ---
++
      def _parse_instance_name(self, name):
          """ If the instance name is not blank, 'name' will have the form
          {instance_name}/{operation_uuid}. Otherwise, it will just be

buildgrid/server/scheduler.py

@@ -20,33 +20,74 @@ Schedules jobs.
  """
  from collections import deque
 +from datetime import timedelta
  import logging
 +from buildgrid._enums import LeaseState, OperationStage
  from buildgrid._exceptions import NotFoundError
 -from .job import OperationStage, LeaseState
+-
  class Scheduler:
      MAX_N_TRIES = 5
 -    def __init__(self, action_cache=None):
 +    def __init__(self, action_cache=None, monitor=False):
          self.__logger = logging.getLogger(__name__)
 +        self.__build_metadata_queues = None
++
 +        self.__operations_by_stage = None
 +        self.__leases_by_state = None
 +        self.__queue_time_average = None
 +        self.__retries_count = 0
++
          self._action_cache = action_cache
          self.jobs = {}
          self.queue = deque()
 +        self._is_instrumented = monitor
++
 +        if self._is_instrumented:
 +            self.__build_metadata_queues = []
++
 +            self.__operations_by_stage = {}
 +            self.__leases_by_state = {}
 +            self.__queue_time_average = 0, timedelta()
++
 +            self.__operations_by_stage[OperationStage.CACHE_CHECK] = set()
 +            self.__operations_by_stage[OperationStage.QUEUED] = set()
 +            self.__operations_by_stage[OperationStage.EXECUTING] = set()
 +            self.__operations_by_stage[OperationStage.COMPLETED] = set()
++
 +            self.__leases_by_state[LeaseState.PENDING] = set()
 +            self.__leases_by_state[LeaseState.ACTIVE] = set()
 +            self.__leases_by_state[LeaseState.COMPLETED] = set()
++
 +    # --- Public API ---
++
      def register_client(self, job_name, queue):
 -        self.jobs[job_name].register_client(queue)
 +        job = self.jobs[job_name]
++
 +        job.register_client(queue)
      def unregister_client(self, job_name, queue):
 -        self.jobs[job_name].unregister_client(queue)
 +        job = self.jobs[job_name]
++
 +        job.unregister_client(queue)
 -        if not self.jobs[job_name].n_clients and self.jobs[job_name].operation.done:
 +        if not job.n_clients and job.operation.done:
              del self.jobs[job_name]
 +            if self._is_instrumented:
 +                self.__operations_by_stage[OperationStage.CACHE_CHECK].discard(job_name)
 +                self.__operations_by_stage[OperationStage.QUEUED].discard(job_name)
 +                self.__operations_by_stage[OperationStage.EXECUTING].discard(job_name)
 +                self.__operations_by_stage[OperationStage.COMPLETED].discard(job_name)
++
 +                self.__leases_by_state[LeaseState.PENDING].discard(job_name)
 +                self.__leases_by_state[LeaseState.ACTIVE].discard(job_name)
 +                self.__leases_by_state[LeaseState.COMPLETED].discard(job_name)
++
      def queue_job(self, job, skip_cache_lookup=False):
          self.jobs[job.name] = job
@@ -62,23 +103,30 @@ class Scheduler:
                  job.set_cached_result(action_result)
                  operation_stage = OperationStage.COMPLETED
 +                if self._is_instrumented:
 +                    self.__retries_count += 1
++
          else:
              operation_stage = OperationStage.QUEUED
              self.queue.append(job)
 -        job.update_operation_stage(operation_stage)
 +        self._update_job_operation_stage(job.name, operation_stage)
      def retry_job(self, job_name):
 -        if job_name in self.jobs:
 -            job = self.jobs[job_name]
 -            if job.n_tries >= self.MAX_N_TRIES:
 -                # TODO: Decide what to do with these jobs
 -                job.update_operation_stage(OperationStage.COMPLETED)
 -                # TODO: Mark these jobs as done
 -            else:
 -                job.update_operation_stage(OperationStage.QUEUED)
 -                job.update_lease_state(LeaseState.PENDING)
 -                self.queue.append(job)
 +        job = self.jobs[job_name]
++
 +        operation_stage = None
 +        if job.n_tries >= self.MAX_N_TRIES:
 +            # TODO: Decide what to do with these jobs
 +            operation_stage = OperationStage.COMPLETED
 +            # TODO: Mark these jobs as done
++
 +        else:
 +            operation_stage = OperationStage.QUEUED
 +            job.update_lease_state(LeaseState.PENDING)
 +            self.queue.append(job)
++
 +        self._update_job_operation_stage(job_name, operation_stage)
      def list_jobs(self):
          return self.jobs.values()
@@ -118,17 +166,27 @@ class Scheduler:
              lease_result (google.protobuf.Any): the lease execution result, only
                  required if `lease_state` is `COMPLETED`.
          """
+-
          job = self.jobs[lease.id]
          lease_state = LeaseState(lease.state)
 +        operation_stage = None
          if lease_state == LeaseState.PENDING:
              job.update_lease_state(LeaseState.PENDING)
 -            job.update_operation_stage(OperationStage.QUEUED)
 +            operation_stage = OperationStage.QUEUED
++
 +            if self._is_instrumented:
 +                self.__leases_by_state[LeaseState.PENDING].add(lease.id)
 +                self.__leases_by_state[LeaseState.ACTIVE].discard(lease.id)
 +                self.__leases_by_state[LeaseState.COMPLETED].discard(lease.id)
          elif lease_state == LeaseState.ACTIVE:
              job.update_lease_state(LeaseState.ACTIVE)
 -            job.update_operation_stage(OperationStage.EXECUTING)
 +            operation_stage = OperationStage.EXECUTING
++
 +            if self._is_instrumented:
 +                self.__leases_by_state[LeaseState.PENDING].discard(lease.id)
 +                self.__leases_by_state[LeaseState.ACTIVE].add(lease.id)
 +                self.__leases_by_state[LeaseState.COMPLETED].discard(lease.id)
          elif lease_state == LeaseState.COMPLETED:
              job.update_lease_state(LeaseState.COMPLETED,
@@ -137,7 +195,14 @@ class Scheduler:
              if self._action_cache is not None and not job.do_not_cache:
                  self._action_cache.update_action_result(job.action_digest, job.action_result)
 -            job.update_operation_stage(OperationStage.COMPLETED)
 +            operation_stage = OperationStage.COMPLETED
++
 +            if self._is_instrumented:
 +                self.__leases_by_state[LeaseState.PENDING].discard(lease.id)
 +                self.__leases_by_state[LeaseState.ACTIVE].discard(lease.id)
 +                self.__leases_by_state[LeaseState.COMPLETED].add(lease.id)
++
 +        self._update_job_operation_stage(lease.id, operation_stage)
      def get_job_lease(self, job_name):
          """Returns the lease associated to job, if any have been emitted yet."""
@@ -160,3 +225,109 @@ class Scheduler:
              job_name (str): name of the job holding the operation to cancel.
          """
          self.jobs[job_name].cancel_operation()
++
 +    # --- Public API: Monitoring ---
++
 +    @property
 +    def is_instrumented(self):
 +        return self._is_instrumented
++
 +    def register_build_metadata_watcher(self, message_queue):
 +        if self.__build_metadata_queues is not None:
 +            self.__build_metadata_queues.append(message_queue)
++
 +    def query_n_jobs(self):
 +        return len(self.jobs)
++
 +    def query_n_operations(self):
 +        # For now n_operations == n_jobs:
 +        return len(self.jobs)
++
 +    def query_n_operations_by_stage(self, operation_stage):
 +        try:
 +            if self.__operations_by_stage is not None:
 +                return len(self.__operations_by_stage[operation_stage])
 +        except KeyError:
 +            pass
 +        return 0
++
 +    def query_n_leases(self):
 +        return len(self.jobs)
++
 +    def query_n_leases_by_state(self, lease_state):
 +        try:
 +            if self.__leases_by_state is not None:
 +                return len(self.__leases_by_state[lease_state])
 +        except KeyError:
 +            pass
 +        return 0
++
 +    def query_n_retries(self):
 +        return self.__retries_count
++
 +    def query_am_queue_time(self):
 +        if self.__queue_time_average is not None:
 +            return self.__queue_time_average[1]
 +        return timedelta()
++
 +    # --- Private API ---
++
 +    def _update_job_operation_stage(self, job_name, operation_stage):
 +        """Requests a stage transition for the job's :class:Operations.
++
 +        Args:
 +            job_name (str): name of the job to query.
 +            operation_stage (OperationStage): the stage to transition to.
 +        """
 +        job = self.jobs[job_name]
++
 +        if operation_stage == OperationStage.CACHE_CHECK:
 +            job.update_operation_stage(OperationStage.CACHE_CHECK)
++
 +            if self._is_instrumented:
 +                self.__operations_by_stage[OperationStage.CACHE_CHECK].add(job_name)
 +                self.__operations_by_stage[OperationStage.QUEUED].discard(job_name)
 +                self.__operations_by_stage[OperationStage.EXECUTING].discard(job_name)
 +                self.__operations_by_stage[OperationStage.COMPLETED].discard(job_name)
++
 +        elif operation_stage == OperationStage.QUEUED:
 +            job.update_operation_stage(OperationStage.QUEUED)
++
 +            if self._is_instrumented:
 +                self.__operations_by_stage[OperationStage.CACHE_CHECK].discard(job_name)
 +                self.__operations_by_stage[OperationStage.QUEUED].add(job_name)
 +                self.__operations_by_stage[OperationStage.EXECUTING].discard(job_name)
 +                self.__operations_by_stage[OperationStage.COMPLETED].discard(job_name)
++
 +        elif operation_stage == OperationStage.EXECUTING:
 +            job.update_operation_stage(OperationStage.EXECUTING)
++
 +            if self._is_instrumented:
 +                self.__operations_by_stage[OperationStage.CACHE_CHECK].discard(job_name)
 +                self.__operations_by_stage[OperationStage.QUEUED].discard(job_name)
 +                self.__operations_by_stage[OperationStage.EXECUTING].add(job_name)
 +                self.__operations_by_stage[OperationStage.COMPLETED].discard(job_name)
++
 +        elif operation_stage == OperationStage.COMPLETED:
 +            job.update_operation_stage(OperationStage.COMPLETED)
++
 +            if self._is_instrumented:
 +                self.__operations_by_stage[OperationStage.CACHE_CHECK].discard(job_name)
 +                self.__operations_by_stage[OperationStage.QUEUED].discard(job_name)
 +                self.__operations_by_stage[OperationStage.EXECUTING].discard(job_name)
 +                self.__operations_by_stage[OperationStage.COMPLETED].add(job_name)
++
 +                average_order, average_time = self.__queue_time_average
++
 +                average_order += 1
 +                if average_order <= 1:
 +                    average_time = job.query_queue_time()
 +                else:
 +                    queue_time = job.query_queue_time()
 +                    average_time = average_time + ((queue_time - average_time) / average_order)
++
 +                self.__queue_time_average = average_order, average_time
++
 +                if not job.holds_cached_action_result:
 +                    for message_queue in self.__build_metadata_queues:
 +                        message_queue.put(job.action_result.execution_metadata)

buildgrid/settings.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
  import hashlib
 -# The hash function that CAS uses
 +# Hash function used for computing digests:
  HASH = hashlib.sha256
++
 +# Lenght in bytes of a hash string returned by HASH:
  HASH_LENGTH = HASH().digest_size * 2
++
 +# Period, in seconds, for the monitoring cycle:
 +MONITORING_PERIOD = 5.0
++
 +# Maximum size for a single gRPC request:
 +MAX_REQUEST_SIZE = 2 * 1024 * 1024
++
 +# Maximum number of elements per gRPC request:
 +MAX_REQUEST_COUNT = 500
++
 +# String format for log records:
 +LOG_RECORD_FORMAT = '%(asctime)s:[%(name)36.36s][%(levelname)5.5s]: %(message)s'
 +# The different log record attributes are documented here:
 +# https://docs.python.org/3/library/logging.html#logrecord-attributes

buildgrid/utils.py

@@ -30,6 +30,14 @@ def get_hostname():
      return socket.gethostname()
 +def get_hash_type():
 +    """Returns the hash type."""
 +    hash_name = HASH().name
 +    if hash_name == "sha256":
 +        return remote_execution_pb2.SHA256
 +    return remote_execution_pb2.UNKNOWN
++
++
  def create_digest(bytes_to_digest):
      """Computes the :obj:`Digest` of a piece of data.

@@ -112,13 +112,15 @@ setup(
      license="Apache License, Version 2.0",
      description="A remote execution service",
      packages=find_packages(),
 +    python_requires='>= 3.5.3',  # janus requirement
      install_requires=[
 -        'protobuf',
 -        'grpcio',
 -        'Click',
 -        'PyYAML',
          'boto3 < 1.8.0',
          'botocore < 1.11.0',
 +        'click',
 +        'grpcio',
 +        'janus',
 +        'protobuf',
 +        'pyyaml',
      ],
      entry_points={
          'console_scripts': [

tests/cas/test_storage.py

@@ -21,8 +21,8 @@ import tempfile
  import boto3
  import grpc
 -import pytest
  from moto import mock_s3
 +import pytest
  from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2
  from buildgrid.server.cas.storage.remote import RemoteStorage

tests/integration/capabilities_service.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
 +# pylint: disable=redefined-outer-name
++
++
 +import grpc
 +import pytest
++
 +from buildgrid._protos.build.bazel.remote.execution.v2 import remote_execution_pb2
 +from buildgrid.client.capabilities import CapabilitiesInterface
 +from buildgrid.server.controller import ExecutionController
 +from buildgrid.server.actioncache.storage import ActionCache
 +from buildgrid.server.cas.instance import ContentAddressableStorageInstance
 +from buildgrid.server.cas.storage.lru_memory_cache import LRUMemoryCache
++
 +from ..utils.utils import run_in_subprocess
 +from ..utils.capabilities import serve_capabilities_service
++
++
 +INSTANCES = ['', 'instance']
++
++
 +# Use subprocess to avoid creation of gRPC threads in main process
 +# See https://github.com/grpc/grpc/blob/master/doc/fork_support.md
 +# Multiprocessing uses pickle which protobufs don't work with
 +# Workaround wrapper to send messages as strings
 +class ServerInterface:
++
 +    def __init__(self, remote):
 +        self.__remote = remote
++
 +    def get_capabilities(self, instance_name):
++
 +        def __get_capabilities(queue, remote, instance_name):
 +            interface = CapabilitiesInterface(grpc.insecure_channel(remote))
++
 +            result = interface.get_capabilities(instance_name)
 +            queue.put(result.SerializeToString())
++
 +        result = run_in_subprocess(__get_capabilities,
 +                                   self.__remote, instance_name)
++
 +        capabilities = remote_execution_pb2.ServerCapabilities()
 +        capabilities.ParseFromString(result)
 +        return capabilities
++
++
 +@pytest.mark.parametrize('instance', INSTANCES)
 +def test_execution_not_available_capabilities(instance):
 +    with serve_capabilities_service([instance]) as server:
 +        server_interface = ServerInterface(server.remote)
 +        response = server_interface.get_capabilities(instance)
++
 +        assert not response.execution_capabilities.exec_enabled
++
++
 +@pytest.mark.parametrize('instance', INSTANCES)
 +def test_execution_available_capabilities(instance):
 +    controller = ExecutionController()
++
 +    with serve_capabilities_service([instance],
 +                                    execution_instance=controller.execution_instance) as server:
 +        server_interface = ServerInterface(server.remote)
 +        response = server_interface.get_capabilities(instance)
++
 +        assert response.execution_capabilities.exec_enabled
 +        assert response.execution_capabilities.digest_function
++
++
 +@pytest.mark.parametrize('instance', INSTANCES)
 +def test_action_cache_allow_updates_capabilities(instance):
 +    storage = LRUMemoryCache(limit=256)
 +    action_cache = ActionCache(storage, max_cached_refs=256, allow_updates=True)
++
 +    with serve_capabilities_service([instance],
 +                                    action_cache_instance=action_cache) as server:
 +        server_interface = ServerInterface(server.remote)
 +        response = server_interface.get_capabilities(instance)
++
 +        assert response.cache_capabilities.action_cache_update_capabilities.update_enabled
++
++
 +@pytest.mark.parametrize('instance', INSTANCES)
 +def test_action_cache_not_allow_updates_capabilities(instance):
 +    storage = LRUMemoryCache(limit=256)
 +    action_cache = ActionCache(storage, max_cached_refs=256, allow_updates=False)
++
 +    with serve_capabilities_service([instance],
 +                                    action_cache_instance=action_cache) as server:
 +        server_interface = ServerInterface(server.remote)
 +        response = server_interface.get_capabilities(instance)
++
 +        assert not response.cache_capabilities.action_cache_update_capabilities.update_enabled
++
++
 +@pytest.mark.parametrize('instance', INSTANCES)
 +def test_cas_capabilities(instance):
 +    cas = ContentAddressableStorageInstance(None)
++
 +    with serve_capabilities_service([instance],
 +                                    cas_instance=cas) as server:
 +        server_interface = ServerInterface(server.remote)
 +        response = server_interface.get_capabilities(instance)
++
 +        assert len(response.cache_capabilities.digest_function) == 1
 +        assert response.cache_capabilities.digest_function[0]
 +        assert response.cache_capabilities.symlink_absolute_path_strategy
 +        assert response.cache_capabilities.max_batch_total_size_bytes

tests/utils/capabilities.py

 +# Copyright (C) 2018 Bloomberg LP
 +#
 +# Licensed under the Apache License, Version 2.0 (the "License");
 +# you may not use this file except in compliance with the License.
 +# You may obtain a copy of the License at
 +#
 +#  <http://www.apache.org/licenses/LICENSE-2.0>
 +#
 +# Unless required by applicable law or agreed to in writing, software
 +# distributed under the License is distributed on an "AS IS" BASIS,
 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 +# See the License for the specific language governing permissions and
 +# limitations under the License.
++
++
 +from concurrent import futures
 +from contextlib import contextmanager
 +import multiprocessing
 +import os
 +import signal
++
 +import grpc
 +import pytest_cov
++
 +from buildgrid.server.capabilities.service import CapabilitiesService
 +from buildgrid.server.capabilities.instance import CapabilitiesInstance
++
++
 +@contextmanager
 +def serve_capabilities_service(instances,
 +                               cas_instance=None,
 +                               action_cache_instance=None,
 +                               execution_instance=None):
 +    server = Server(instances,
 +                    cas_instance,
 +                    action_cache_instance,
 +                    execution_instance)
 +    try:
 +        yield server
 +    finally:
 +        server.quit()
++
++
 +class Server:
++
 +    def __init__(self, instances,
 +                 cas_instance=None,
 +                 action_cache_instance=None,
 +                 execution_instance=None):
 +        self.instances = instances
++
 +        self.__queue = multiprocessing.Queue()
 +        self.__process = multiprocessing.Process(
 +            target=Server.serve,
 +            args=(self.__queue, self.instances, cas_instance, action_cache_instance, execution_instance))
 +        self.__process.start()
++
 +        self.port = self.__queue.get(timeout=1)
 +        self.remote = 'localhost:{}'.format(self.port)
++
 +    @staticmethod
 +    def serve(queue, instances, cas_instance, action_cache_instance, execution_instance):
 +        pytest_cov.embed.cleanup_on_sigterm()
++
 +        # Use max_workers default from Python 3.5+
 +        max_workers = (os.cpu_count() or 1) * 5
 +        server = grpc.server(futures.ThreadPoolExecutor(max_workers))
 +        port = server.add_insecure_port('localhost:0')
++
 +        capabilities_service = CapabilitiesService(server)
 +        for name in instances:
 +            capabilities_instance = CapabilitiesInstance(cas_instance, action_cache_instance, execution_instance)
 +            capabilities_service.add_instance(name, capabilities_instance)
++
 +        server.start()
 +        queue.put(port)
 +        signal.pause()
++
 +    def quit(self):
 +        if self.__process:
 +            self.__process.terminate()
 +            self.__process.join()

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]