radicle-dev / infra Goto Github PK
View Code? Open in Web Editor NEWInfrastructure
Infrastructure
Followup of #15
The nice thing about backing docker volumes by a CoW filesystem is that we could incrementally store snapshots on shared storage (eg. Google Cloud Storage), which would help in farming out build agents. The tricky part is to ensure ancestry for concurrent builds when the storage end is not a ZFS pool:
Let M
be the master cache.
Let B_1
and B_2
be concurrent builds of the same branch, each inheriting M
. Their cache state results in C_1
and C_2
, respectively.
When B_3
is triggered, it needs to choose one of C_1
, C_2
, or M
. The problem is to find the optimal choice wrt to cache invalidation.
For easy deployment and rollbacks we should provide a debian package for the CI code.
We still have the configuration of some domains as terraform files in the private monadic-xyz/infrastructure
repo. We should move that here.
The command hook needs to pass variables explicitly to docker run
. There's no way to distinguish which variables (of env
) were set in the pipeline, so perhaps require some prefix?
The distro version of the buildkite agent instances we run on GCP (Ubuntu 20.04) does not match some of the package sources we are using in cloud-init.yaml
. (These include debian buster and ubuntu bionic.)
All releases should be cut from the #63 box, and appropriately signed.
signify is a good tool to avoid GPG and the associated key management headaches, but GPG may still be needed for some use cases (e.g. deb
packages).
Access to storage platforms for the produced artifacts also needs to be provided for the #63 box, with appropriate documentation and credential rotation procedures. This is part of this issue.
Depends on #63
The stateful ffnet validators should store their chain data on the persistent volume so that they don’t need to resync when they are restarted. The persistent volume is already mounted, we just need to pass it with the --data
flag to the node executable.
To make it easier for developers and users to identify and reference artifacts (both binaries and docker images) I propose we use tag names to identify them when tags are built.
If a tag foo
is build the docker image should be tagged ${STEP_DOCKER_IMAGE}:${BUILDKITE_TAG}
. Similarly artifacts should be published to gs://builds.radicle.xyz/${BUILDKITE_PIPELINE_SLUG}/tags/${BUILDKITE_TAG}
.
We need historical view on data from telemetry.polkadot.io. Possible solutions:
Useful for "master is always green" and caching policies (cf. #9).
Need to find out which of the bors variants is the least hassle to operate, and exposes the least attack surface.
The terraform code for the registry development infrastructure lives in a private repo at the moment. It should be moved here. We should carefully review the repo to not expose sensitive data.
We don't add newly provisioned agents to the production pool by default. When provisioning from a known-good version, it should be possible to set instance metadata which would cause the set of agent tags to be set such that the agent is added to the production pool.
Depends on #54
When a job is cancelled the container that is running the job is not stopped. This consumes resources on the machine and blocks the the cache volume from being used by the next job for this branch. The latter results in a docker: Error response from daemon: File exists (os error 17)
error.
To run and test the latest registry nodes we are currently updating the node version on the devnet for every master build of the registry. This is implemented through a build step that sends an update to the Kubernetes cluster with kubectl
. There are two issues with this approach
To address this issue I propose we move the deployment logic to the devnet cluster. Concretely we run a Kubernetes service or job that updates the image when a new master build is available. I’m not aware of a simple tool that allows us to do that. I’d propose implementing a simple Kubernetes job that pulls the git repository and updates the images when a new commit is available.
Turns out zfs is picky about what characters are allowed in volume names (it's less than a valid POSIX path) -- better just use a hash of the supplied name. Need to figure out if that confuses docker.
We’re not removing images built by a step when STEP_DOCKER_FILE
is set. We should extend the docker-volume-prune
service to also remove these images.
The minimum should be the number of validator peers in the cluster. We can't rely on miner nodes being up.
Apart from better "rootless" support, this would also enable running pods.
It seems like the kata-containers runtime would also be supported, but couldn't find anything in the official docs.
/cc @xla @geigerzaehler
To make it easier to use the devnet with upstram and the registry CLI we should give the RPC node for the devnet a DNS entry. We are already doing this for the ffnet.
Cloud agents do this thanks to google provisioning on official VM images. We may want to use the same on bare-metal, or, better yet, do something more secure. E.g. CA-signed keys.
As a developer using the CI I want to add secrets to my codebase so that a job can take a restricted action (e.g. deploy an update).
To accomplish this we can use sops. A developer can provide a sops file containing encrypted secrets in .buildkite/secrets.yaml
. In the first stage the file is encrypted using a PGP key with the private key residing on the agent. On master builds the secrets are made available to the job using
sops --output-type dotenv -d /build/.buildkite/secrets.yaml
To improve the security we can use GCP KMS to store the private key to decrypt the secrets.
The official rust docker images install cargo in /usr/local
, with no registry cache. Thus, every build spends a lot of time in "Updating crates.io index".
We could pre-populate it, e.g. when installing sccache
, but then have to make /usr/local/cargo/registry
world-writable, as CI containers are not allowed to run as root. Standardising on a proper user account inside the container is a non-solution, as it has to match a host user for access to bind mounts. Potentially, the registry could be cached on the host, and bind mounted into the guest.
The nodes are being restarted automatically, so down times are a norm. We need to switch alerting from based on Prometheus' up
metric to one of Stackdriver's, which can tell crash from restart.
Notes from @geigerzaehler:
One issue we have with the current setup is that we get false alerts for nodes being down when miners are pre-empted. To check miner availability we need to distinguish between nodes being down because its pod is rescheduled due to a VM being pre-empted and a node being down because it has crashed.
I found that kube-state-metrics exposes a metric for a the pod phase. We could use the “Failed” status of that metrics to determine whether a node crashed.
Because security, the following may be considered:
Currently new branch builds use a cache volume that uses the agent scoped master cache volume as the origin instead of the shared master cache volume, even if SHARED_MASTER_CACHE
is set to true
. A workaround is to set SHARED_MASTER_CACHE=true
for the first builkdite pipeline upload
step.
The reason for this behavior is that the cache volumes are created only once but in any build step, even for the special buildkite pipeline upload
step. To address this properly we should move the volume creation to a later stage, where it is skipped when we do the pipeline upload.
To speed up builds we would like to cache CARGO_TARGET_DIR
(which is usually ./target
) between builds of radicle-registry
. Caching the target directory gives a significant speed up over just ussing sccache
since only the project code is recompiled. (I’ll try to provide hard numbers later.)
The target directory contains roughly 4GB for radicle-registry
. So for sensible caching we’ll need roughly 8GB of space on the /cache
volume. Inside the project we’ll also need to think of a strategy to clean up the target directory.
We have Clouflare as the nameserver of radicle.run, where DNS records are managed. We should add the Cloudflare configuration here.
This allows monitoring for DDoS attacks and for genuinely invalid actors. Depends on paritytech/substrate#6006
deb
packages make provisioning our own infrastructure tools much more convenient.
However, using standalone deb
files can't take advantage of apt
s dependency resolution, which makes provisioning code more brittle. We've been using bintray for this, but it's a bit arcane, and frequently blocks our account for storage quota breach. packagecloud is an alternative, but the pricing of both doesn't seem very attractive for our use case (it's not to distribute our end-user products).
Managing a debian repo is not too complicated, iff there is a single non-volatile machine running reprepro
, and syncing the state to cloud storage. Given #63 and #64, we could easily set this up.
Sometimes build cache volumes are created without the build_cache
label. This results in those volumes not being cleaned up by docker-volume-prune.sh
and thus we run out of disk space on the machine.
Zie bash it make me cringe
This allows monitoring for DDoS attacks and for genuinely invalid actors. Depends on radicle-dev/radicle-registry#463
At the moment the cache volume mounted at /cache
is not shared between agents. This means that a build running on agent X cannot utilize the cache created by agent Y.
A possible solution would be to use Google Cloud Storage to for the cache. This worked well for oscoin-hs, see https://github.com/oscoin/oscoin/blob/master/scripts/ci-cache.sh. The agent could just provide GCS credentials for privileged builds so these builds can use gutils
, or the agent would take care of actually uploading the cache.
The nodes get a negative chain length change after being restarted. Negative change shouldn't be a reason to raise the` alarm.
We mount the token into the build containers of "trusted" builds, but it's a bit unclear what attack surface this opens. We should at least rotate the tokens from time to time, and maybe request from BK to give us better scoping.
podman doesn't have pluggable volume drivers, so we need to create ZFS filesystems before invoking podman, and use bind mounts instead of volumes. The ZFS-specifics of zockervols
should be reusable.
Depends on #54
As we can't control concurrency from the agent, the cache volumes associated with default branches are scoped per agent instance. This should normally converge quickly, but builds with high cache churn (as apparently radicle-registry
) suffer from significant slowdowns.
Since concurrency can be controlled from pipeline.yaml, it is conceivable to offer an opt-in, "use at your own risk" environment variable which forces non-scoped cache volumes for "trusted" builds. If in the future we farm out the agents to multiple machines, the cache snapshots can also be obtained over the network.
The registry team wants to be alerted when the FFnet or the underyling infrastructure become unhealthy. For this we need to decide the following
This allows monitoring for DDoS attacks and for genuinely invalid actors. Depends on radicle-dev/radicle-registry#407
The RPC node for the ffnet is not stateful. This means that on re-deployment it needs to sync the whole chain. This process is CPU intensive. At the moment we have configured a limit of 0.2 CPUs for the RPC deployment to use. We should remove this limit so that the RPC node can use more CPU if available and allow faster syncing.
It ain't designed to always be convergent, but it's safe to update the static configuration. There should be a script to do that.
Depends on #13
The script to prune old cache volumes is not working. Cache volumes have accumlated on our machine and the systemd service prints errors.
This is basically what registry is doing from within the build scripts. Under the assumption that caches grow indefinitely due to dependency churn, we could also have the agent manage this.
The only thing to consider is the dependency tree from master -> branch builds, which might prevent a recursive zfs destroy
when child datasets are in use. Perhaps this should be a timer which runs at non-peak times.
Sometimes nodes get stuck while syncing when started for the first time. We should create an alert for that case.
This should be an airgapped box physically in the HQ. All artifact-producing build steps should run there.
At the moment the ffnet nodes that we run use randomly generated names to report to telemetry.polkadot.io. This makes it hard to distinguish them from nodes run by other people. I propose we use monadic-{random_number}
as the name for our nodes.
This should be monitored the same as mining too slow, except it should trigger only if is_major_synced
is set. The margin of error should be smaller for averages over larger time windows.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.