Coder Social home page Coder Social logo

infra's People

Contributors

codesandwich avatar fintanh avatar kim avatar rockbmb avatar rudolfs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

crt-fork

infra's Issues

Put cache snapshots on shared storage

Followup of #15

The nice thing about backing docker volumes by a CoW filesystem is that we could incrementally store snapshots on shared storage (eg. Google Cloud Storage), which would help in farming out build agents. The tricky part is to ensure ancestry for concurrent builds when the storage end is not a ZFS pool:

Let M be the master cache.
Let B_1 and B_2 be concurrent builds of the same branch, each inheriting M. Their cache state results in C_1 and C_2, respectively.

When B_3 is triggered, it needs to choose one of C_1, C_2, or M. The problem is to find the optimal choice wrt to cache invalidation.

Provide debian package

For easy deployment and rollbacks we should provide a debian package for the CI code.

metal: Devise release key management

All releases should be cut from the #63 box, and appropriately signed.

signify is a good tool to avoid GPG and the associated key management headaches, but GPG may still be needed for some use cases (e.g. deb packages).

Access to storage platforms for the produced artifacts also needs to be provided for the #63 box, with appropriate documentation and credential rotation procedures. This is part of this issue.

Depends on #63

Validators should store chain data on persistent volume

The stateful ffnet validators should store their chain data on the persistent volume so that they don’t need to resync when they are restarted. The persistent volume is already mounted, we just need to pass it with the --data flag to the node executable.

Use sops/git-crypt to encrypt secrets in codebase

We want two things that sops provides

  • Everybody to see what secret environment variables are exported without seeing the plaintext value
  • A group of separate individuals to read and edit the secrets

git-crypt is a simpler alternative but it does not satisfy the first point.

Use tag names to easily identify build artifacts

To make it easier for developers and users to identify and reference artifacts (both binaries and docker images) I propose we use tag names to identify them when tags are built.

If a tag foo is build the docker image should be tagged ${STEP_DOCKER_IMAGE}:${BUILDKITE_TAG}. Similarly artifacts should be published to gs://builds.radicle.xyz/${BUILDKITE_PIPELINE_SLUG}/tags/${BUILDKITE_TAG}.

Provide a build bot

Useful for "master is always green" and caching policies (cf. #9).

Need to find out which of the bors variants is the least hassle to operate, and exposes the least attack surface.

Integrate registry infrastructure code

The terraform code for the registry development infrastructure lives in a private repo at the moment. It should be moved here. We should carefully review the repo to not expose sensitive data.

cloud: Inject agent tags via instance metadata

We don't add newly provisioned agents to the production pool by default. When provisioning from a known-good version, it should be possible to set instance metadata which would cause the set of agent tags to be set such that the agent is added to the production pool.

Depends on #54

Cancelling job does not stop container

When a job is cancelled the container that is running the job is not stopped. This consumes resources on the machine and blocks the the cache volume from being used by the next job for this branch. The latter results in a docker: Error response from daemon: File exists (os error 17) error.

Improve continuous deployment for registry nodes

To run and test the latest registry nodes we are currently updating the node version on the devnet for every master build of the registry. This is implemented through a build step that sends an update to the Kubernetes cluster with kubectl. There are two issues with this approach

  • The build agent needs credentials with permission to update the image. This is a security risk
  • The build step in the registry repo needs to be aligned with the cluster setup here. I.e. if we add a deployment of nodes we also need to update the deployment code in the registry repo.

To address this issue I propose we move the deployment logic to the devnet cluster. Concretely we run a Kubernetes service or job that updates the image when a new master build is available. I’m not aware of a simple tool that allows us to do that. I’d propose implementing a simple Kubernetes job that pulls the git repository and updates the images when a new commit is available.

Hash volume name

Turns out zfs is picky about what characters are allowed in volume names (it's less than a valid POSIX path) -- better just use a hash of the supplied name. Need to figure out if that confuses docker.

Prune images built by steps

We’re not removing images built by a step when STEP_DOCKER_FILE is set. We should extend the docker-volume-prune service to also remove these images.

Add SSH keys on bare-metal agent(s)

Cloud agents do this thanks to google provisioning on official VM images. We may want to use the same on bare-metal, or, better yet, do something more secure. E.g. CA-signed keys.

Self-service secrets

As a developer using the CI I want to add secrets to my codebase so that a job can take a restricted action (e.g. deploy an update).

To accomplish this we can use sops. A developer can provide a sops file containing encrypted secrets in .buildkite/secrets.yaml. In the first stage the file is encrypted using a PGP key with the private key residing on the agent. On master builds the secrets are made available to the job using

sops --output-type dotenv -d /build/.buildkite/secrets.yaml

To improve the security we can use GCP KMS to store the private key to decrypt the secrets.

Pre-populate local crates.io registry cache

Problem

The official rust docker images install cargo in /usr/local, with no registry cache. Thus, every build spends a lot of time in "Updating crates.io index".

Solutions

We could pre-populate it, e.g. when installing sccache, but then have to make /usr/local/cargo/registry world-writable, as CI containers are not allowed to run as root. Standardising on a proper user account inside the container is a non-solution, as it has to match a host user for access to bind mounts. Potentially, the registry could be cached on the host, and bind mounted into the guest.

Fix nodes being down alerts spam

The nodes are being restarted automatically, so down times are a norm. We need to switch alerting from based on Prometheus' up metric to one of Stackdriver's, which can tell crash from restart.

Notes from @geigerzaehler:

One issue we have with the current setup is that we get false alerts for nodes being down when miners are pre-empted. To check miner availability we need to distinguish between nodes being down because its pod is rescheduled due to a VM being pre-empted and a node being down because it has crashed.

I found that kube-state-metrics exposes a metric for a the pod phase. We could use the “Failed” status of that metrics to determine whether a node crashed.

Shared master cache requires env variable for pipeline upload step

Currently new branch builds use a cache volume that uses the agent scoped master cache volume as the origin instead of the shared master cache volume, even if SHARED_MASTER_CACHE is set to true. A workaround is to set SHARED_MASTER_CACHE=true for the first builkdite pipeline upload step.

The reason for this behavior is that the cache volumes are created only once but in any build step, even for the special buildkite pipeline upload step. To address this properly we should move the volume creation to a later stage, where it is skipped when we do the pipeline upload.

More disk space to cache CARGO_TARGET_DIR

To speed up builds we would like to cache CARGO_TARGET_DIR (which is usually ./target) between builds of radicle-registry. Caching the target directory gives a significant speed up over just ussing sccache since only the project code is recompiled. (I’ll try to provide hard numbers later.)

The target directory contains roughly 4GB for radicle-registry. So for sensible caching we’ll need roughly 8GB of space on the /cache volume. Inside the project we’ll also need to think of a strategy to clean up the target directory.

debian package repository

deb packages make provisioning our own infrastructure tools much more convenient.

However, using standalone deb files can't take advantage of apts dependency resolution, which makes provisioning code more brittle. We've been using bintray for this, but it's a bit arcane, and frequently blocks our account for storage quota breach. packagecloud is an alternative, but the pricing of both doesn't seem very attractive for our use case (it's not to distribute our end-user products).

Managing a debian repo is not too complicated, iff there is a single non-volatile machine running reprepro, and syncing the state to cloud storage. Given #63 and #64, we could easily set this up.

Distributed cache

At the moment the cache volume mounted at /cache is not shared between agents. This means that a build running on agent X cannot utilize the cache created by agent Y.

A possible solution would be to use Google Cloud Storage to for the cache. This worked well for oscoin-hs, see https://github.com/oscoin/oscoin/blob/master/scripts/ci-cache.sh. The agent could just provide GCS credentials for privileged builds so these builds can use gutils, or the agent would take care of actually uploading the cache.

podman: Manage volumes on ZFS

podman doesn't have pluggable volume drivers, so we need to create ZFS filesystems before invoking podman, and use bind mounts instead of volumes. The ZFS-specifics of zockervols should be reusable.

Depends on #54

Opt-in shared master cache volumes

As we can't control concurrency from the agent, the cache volumes associated with default branches are scoped per agent instance. This should normally converge quickly, but builds with high cache churn (as apparently radicle-registry) suffer from significant slowdowns.

Since concurrency can be controlled from pipeline.yaml, it is conceivable to offer an opt-in, "use at your own risk" environment variable which forces non-scoped cache volumes for "trusted" builds. If in the future we farm out the agents to multiple machines, the cache snapshots can also be obtained over the network.

Set up alerting for ffnet health

The registry team wants to be alerted when the FFnet or the underyling infrastructure become unhealthy. For this we need to decide the following

  • Through which channel do we want to be alerted?
  • On which conditions do we want to be alerted? (To begin with it should be enough to choose a few critical ones.)
  • What tools/stack do we want to use?

RPC node deployment should allow fast sync with CPU burst

The RPC node for the ffnet is not stateful. This means that on re-deployment it needs to sync the whole chain. This process is CPU intensive. At the moment we have configured a limit of 0.2 CPUs for the RPC deployment to use. We should remove this limit so that the RPC node can use more CPU if available and allow faster syncing.

Automate config updates

It ain't designed to always be convergent, but it's safe to update the static configuration. There should be a script to do that.

Depends on #13

Pruning cache volumes does not work

The script to prune old cache volumes is not working. Cache volumes have accumlated on our machine and the systemd service prints errors.

Consider purging master cache volumes when they reach their quota

This is basically what registry is doing from within the build scripts. Under the assumption that caches grow indefinitely due to dependency churn, we could also have the agent manage this.

The only thing to consider is the dependency tree from master -> branch builds, which might prevent a recursive zfs destroy when child datasets are in use. Perhaps this should be a timer which runs at non-peak times.

Identify our nodes on telemetry.polkadot.io

At the moment the ffnet nodes that we run use randomly generated names to report to telemetry.polkadot.io. This makes it hard to distinguish them from nodes run by other people. I propose we use monadic-{random_number} as the name for our nodes.

Alert when blocks are mined too quickly

This should be monitored the same as mining too slow, except it should trigger only if is_major_synced is set. The margin of error should be smaller for averages over larger time windows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.