Put cache snapshots on shared storage

Followup of #15

The nice thing about backing docker volumes by a CoW filesystem is that we could incrementally store snapshots on shared storage (eg. Google Cloud Storage), which would help in farming out build agents. The tricky part is to ensure ancestry for concurrent builds when the storage end is not a ZFS pool:

Let M be the master cache.
Let B_1 and B_2 be concurrent builds of the same branch, each inheriting M. Their cache state results in C_1 and C_2, respectively.

When B_3 is triggered, it needs to choose one of C_1, C_2, or M. The problem is to find the optimal choice wrt to cache invalidation.

Provide debian package

For easy deployment and rollbacks we should provide a debian package for the CI code.

Add radicle.xyz domain configuration

We still have the configuration of some domains as terraform files in the private monadic-xyz/infrastructure repo. We should move that here.

Pass environment variables defined in the pipeline

The command hook needs to pass variables explicitly to docker run. There's no way to distinguish which variables (of env) were set in the pipeline, so perhaps require some prefix?

Build agent instance distro and package distro don’t match

The distro version of the buildkite agent instances we run on GCP (Ubuntu 20.04) does not match some of the package sources we are using in cloud-init.yaml. (These include debian buster and ubuntu bionic.)

cloud: Preemptible instances ASG

This should prevent publishing artifacts, which is probably tricky to solve.

Depends on #53, #54

metal: Devise release key management

All releases should be cut from the #63 box, and appropriately signed.

signify is a good tool to avoid GPG and the associated key management headaches, but GPG may still be needed for some use cases (e.g. deb packages).

Access to storage platforms for the produced artifacts also needs to be provided for the #63 box, with appropriate documentation and credential rotation procedures. This is part of this issue.

Depends on #63

Validators should store chain data on persistent volume

The stateful ffnet validators should store their chain data on the persistent volume so that they don’t need to resync when they are restarted. The persistent volume is already mounted, we just need to pass it with the --data flag to the node executable.

Use sops/git-crypt to encrypt secrets in codebase

We want two things that sops provides

Everybody to see what secret environment variables are exported without seeing the plaintext value
A group of separate individuals to read and edit the secrets

git-crypt is a simpler alternative but it does not satisfy the first point.

Use tag names to easily identify build artifacts

To make it easier for developers and users to identify and reference artifacts (both binaries and docker images) I propose we use tag names to identify them when tags are built.

If a tag foo is build the docker image should be tagged ${STEP_DOCKER_IMAGE}:${BUILDKITE_TAG}. Similarly artifacts should be published to gs://builds.radicle.xyz/${BUILDKITE_PIPELINE_SLUG}/tags/${BUILDKITE_TAG}.

Historical data from telemetry.polkadot.io

We need historical view on data from telemetry.polkadot.io. Possible solutions:

paritytech/substrate-telemetry#165
custom scrapper
- does telemetry.polkadot.io provide any API or do we need to parse HTML? Project repo: https://github.com/paritytech/substrate-telemetry
- can we push the data into Prometheus? With what effort?

Provide a build bot

Useful for "master is always green" and caching policies (cf. #9).

Need to find out which of the bors variants is the least hassle to operate, and exposes the least attack surface.

Integrate registry infrastructure code

The terraform code for the registry development infrastructure lives in a private repo at the moment. It should be moved here. We should carefully review the repo to not expose sensitive data.

cloud: Inject agent tags via instance metadata

We don't add newly provisioned agents to the production pool by default. When provisioning from a known-good version, it should be possible to set instance metadata which would cause the set of agent tags to be set such that the agent is added to the production pool.

Depends on #54

Cancelling job does not stop container

When a job is cancelled the container that is running the job is not stopped. This consumes resources on the machine and blocks the the cache volume from being used by the next job for this branch. The latter results in a docker: Error response from daemon: File exists (os error 17) error.

Improve continuous deployment for registry nodes

To run and test the latest registry nodes we are currently updating the node version on the devnet for every master build of the registry. This is implemented through a build step that sends an update to the Kubernetes cluster with kubectl. There are two issues with this approach

The build agent needs credentials with permission to update the image. This is a security risk
The build step in the registry repo needs to be aligned with the cluster setup here. I.e. if we add a deployment of nodes we also need to update the deployment code in the registry repo.

To address this issue I propose we move the deployment logic to the devnet cluster. Concretely we run a Kubernetes service or job that updates the image when a new master build is available. I’m not aware of a simple tool that allows us to do that. I’d propose implementing a simple Kubernetes job that pulls the git repository and updates the images when a new commit is available.

Hash volume name

Turns out zfs is picky about what characters are allowed in volume names (it's less than a valid POSIX path) -- better just use a hash of the supplied name. Need to figure out if that confuses docker.

Prune images built by steps

We’re not removing images built by a step when STEP_DOCKER_FILE is set. We should extend the docker-volume-prune service to also remove these images.

Alert when a node is not connected to peers

The minimum should be the number of validator peers in the cluster. We can't rely on miner nodes being up.

Use podman instead of dockerd

Apart from better "rootless" support, this would also enable running pods.

It seems like the kata-containers runtime would also be supported, but couldn't find anything in the official docs.

/cc @xla @geigerzaehler

Expose devnet RPC at rpc.devnet.radicle.network

To make it easier to use the devnet with upstram and the registry CLI we should give the RPC node for the devnet a DNS entry. We are already doing this for the ffnet.

Add SSH keys on bare-metal agent(s)

Cloud agents do this thanks to google provisioning on official VM images. We may want to use the same on bare-metal, or, better yet, do something more secure. E.g. CA-signed keys.

Self-service secrets

As a developer using the CI I want to add secrets to my codebase so that a job can take a restricted action (e.g. deploy an update).

To accomplish this we can use sops. A developer can provide a sops file containing encrypted secrets in .buildkite/secrets.yaml. In the first stage the file is encrypted using a PGP key with the private key residing on the agent. On master builds the secrets are made available to the job using

sops --output-type dotenv -d /build/.buildkite/secrets.yaml

To improve the security we can use GCP KMS to store the private key to decrypt the secrets.

Pre-populate local crates.io registry cache

Problem

The official rust docker images install cargo in /usr/local, with no registry cache. Thus, every build spends a lot of time in "Updating crates.io index".

Solutions

We could pre-populate it, e.g. when installing sccache, but then have to make /usr/local/cargo/registry world-writable, as CI containers are not allowed to run as root. Standardising on a proper user account inside the container is a non-solution, as it has to match a host user for access to bind mounts. Potentially, the registry could be cached on the host, and bind mounted into the guest.

Fix nodes being down alerts spam

The nodes are being restarted automatically, so down times are a norm. We need to switch alerting from based on Prometheus' up metric to one of Stackdriver's, which can tell crash from restart.

Notes from @geigerzaehler:

One issue we have with the current setup is that we get false alerts for nodes being down when miners are pre-empted. To check miner availability we need to distinguish between nodes being down because its pod is rescheduled due to a VM being pre-empted and a node being down because it has crashed.

I found that kube-state-metrics exposes a metric for a the pod phase. We could use the “Failed” status of that metrics to determine whether a node crashed.

Support assembling Docker images as part of the build pipeline

Because security, the following may be considered:

Shared master cache requires env variable for pipeline upload step

Currently new branch builds use a cache volume that uses the agent scoped master cache volume as the origin instead of the shared master cache volume, even if SHARED_MASTER_CACHE is set to true. A workaround is to set SHARED_MASTER_CACHE=true for the first builkdite pipeline upload step.

The reason for this behavior is that the cache volumes are created only once but in any build step, even for the special buildkite pipeline upload step. To address this properly we should move the volume creation to a later stage, where it is skipped when we do the pipeline upload.

More disk space to cache CARGO_TARGET_DIR

To speed up builds we would like to cache CARGO_TARGET_DIR (which is usually ./target) between builds of radicle-registry. Caching the target directory gives a significant speed up over just ussing sccache since only the project code is recompiled. (I’ll try to provide hard numbers later.)

The target directory contains roughly 4GB for radicle-registry. So for sensible caching we’ll need roughly 8GB of space on the /cache volume. Inside the project we’ll also need to think of a strategy to clean up the target directory.

Add the cloudflare configuration for radicle.run

We have Clouflare as the nameserver of radicle.run, where DNS records are managed. We should add the Cloudflare configuration here.

Alert when an invalid block is being proposed

This allows monitoring for DDoS attacks and for genuinely invalid actors. Depends on paritytech/substrate#6006

debian package repository

deb packages make provisioning our own infrastructure tools much more convenient.

However, using standalone deb files can't take advantage of apts dependency resolution, which makes provisioning code more brittle. We've been using bintray for this, but it's a bit arcane, and frequently blocks our account for storage quota breach. packagecloud is an alternative, but the pricing of both doesn't seem very attractive for our use case (it's not to distribute our end-user products).

Managing a debian repo is not too complicated, iff there is a single non-volatile machine running reprepro, and syncing the state to cloud storage. Given #63 and #64, we could easily set this up.

Cache volumes are sometimes created without build_cache label

Sometimes build cache volumes are created without the build_cache label. This results in those volumes not being cleaned up by docker-volume-prune.sh and thus we run out of disk space on the machine.

Implement BK hooks in real programming language

Zie bash it make me cringe

Alert when Too many invalid blocks are being proposed

This allows monitoring for DDoS attacks and for genuinely invalid actors. Depends on radicle-dev/radicle-registry#463

Distributed cache

At the moment the cache volume mounted at /cache is not shared between agents. This means that a build running on agent X cannot utilize the cache created by agent Y.

A possible solution would be to use Google Cloud Storage to for the cache. This worked well for oscoin-hs, see https://github.com/oscoin/oscoin/blob/master/scripts/ci-cache.sh. The agent could just provide GCS credentials for privileged builds so these builds can use gutils, or the agent would take care of actually uploading the cache.

Fix nodes being mined too slowly spam

The nodes get a negative chain length change after being restarted. Negative change shouldn't be a reason to raise the` alarm.

Documentation

Getting started
Overview of how it works
Admin

Implement Buildkite agent token rotation tooling and policy

We mount the token into the build containers of "trusted" builds, but it's a bit unclear what attack surface this opens. We should at least rotate the tokens from time to time, and maybe request from BK to give us better scoping.

podman: Manage volumes on ZFS

podman doesn't have pluggable volume drivers, so we need to create ZFS filesystems before invoking podman, and use bind mounts instead of volumes. The ZFS-specifics of zockervols should be reusable.

Depends on #54

Opt-in shared master cache volumes

As we can't control concurrency from the agent, the cache volumes associated with default branches are scoped per agent instance. This should normally converge quickly, but builds with high cache churn (as apparently radicle-registry) suffer from significant slowdowns.

Since concurrency can be controlled from pipeline.yaml, it is conceivable to offer an opt-in, "use at your own risk" environment variable which forces non-scoped cache volumes for "trusted" builds. If in the future we farm out the agents to multiple machines, the cache snapshots can also be obtained over the network.

Set up alerting for ffnet health

The registry team wants to be alerted when the FFnet or the underyling infrastructure become unhealthy. For this we need to decide the following

Through which channel do we want to be alerted?
On which conditions do we want to be alerted? (To begin with it should be enough to choose a few critical ones.)
What tools/stack do we want to use?

Alert when Too many invalid transactions are being proposed

This allows monitoring for DDoS attacks and for genuinely invalid actors. ~~Depends on radicle-dev/radicle-registry#407~~

RPC node deployment should allow fast sync with CPU burst

The RPC node for the ffnet is not stateful. This means that on re-deployment it needs to sync the whole chain. This process is CPU intensive. At the moment we have configured a limit of 0.2 CPUs for the RPC deployment to use. We should remove this limit so that the RPC node can use more CPU if available and allow faster syncing.

radicle-dev / infra Goto Github PK

infra's People

Contributors

Stargazers

Watchers

Forkers

infra's Issues

Problem

Solutions

Recommend Projects

Recommend Topics

Recommend Org