The devops's discuss from ooni

Optimize load balancer configuration

On AWS you pay the load balancers by the hour, irrespective of how much bandwidth goes over them.

At the moment we are using load balancers and target groups to route traffic to individual microservices. For the direct access, which would not be used in production, we should remove the load balancer in front of them, but rather have direct communication to the container.

This will likely involve doing some refactoring of the networking setup, since we need to configure direct connectivity to each host.

Move test helpers from digital ocean to AWS

Test helper rotation script is broken and manual changes were made to DNS to unbrick it on 18th March 2024: https://openobservatory.slack.com/archives/C38EJ0CET/p1710780947922739.

Following this incident the NS delegation of th.ooni.org has been migrated over to AWS, which currently hosts the following A records:
0.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000
1.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
2.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
3.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000

Note that 1 and 2 and 0 and 3 point to the same IP, because there were only 2 running VPS that were not broken from the auto rotation script.

Plan for migration

We plan to migrate all these test helpers over to the AWS ECS based configuration, see: https://github.com/ooni/devops/blob/main/tf/environments/prod/main.tf#L505.

All the previous addresses will be configured to point to ALB entry (see: https://github.com/ooni/devops/blob/main/tf/modules/oonith_service/main.tf#L176) for the oonith_service as aliases (effectively it behaves like a CNAME, but costs less: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html).

Checklist

Implement API gateway for api.ooni.io

Basically we should setup in terraform an API gateway that allows us to route traffic to the new oonidatapi, but keep most of it pointing still to the old legacy instance on backend-fsn.

This would allow us to centrally manage the paths and handle failing over to the old one if something breaks badly.

I will drop in here some links to stuff to look at:

Port oohelperd deployment over to new pattern

This is about doing the following:

Creating dockerfile for oohelperd, see: ooni/backend#827
Creating codebuild/codepipeline buildspec to build and deploy it, see: ooni/backend#827
Adding relevant modules to ooni/devops to deploy the oohelperd to a new ECS cluster following the template of how we did it for ooniapi-services. We should copy those modules and customize them to deploying this golang image.

For the last point we should follow the template of:

ooniapi_service
ooniapi_service_deployer
And make use of ecs_cluster modules to create:
oonith_service
oonith_service_deployer

to create a ECS task definition and builds the dockerfile I wrote above to deploy to a new ecs_cluster the oonihelperd test helper

Investigate how autoscaling works in our ECS deployments

We would like to have the autoscaling groups adjust the number of service instance and ec2 instances to factor for increased load or usage.

We should investigate how we have it setup at the moment and evaluate if some tuning to the autoscaling group is needed in order for it to dynamically adjust.

See:

Fix terraform GH workflow

We want to fix the check_terraform.yml workflow.

Move Code Build and Code Pipeline setup to terraform

As part of ooni/backend#796 I wrote the Code Build and Code Pipeline workflows directly from the AWS UI (it was easier to understand what was going on and how they worked from the web interface).

We should at some point move that into the terraform configuration so that it's reproducibly deployed and can be easily extended to other projects without needing to click through UIs.

Come up with list of tiers

Here is a tentative list:

Tier 0 (Critical) components

Probe Services (collector specifically)
Fastpath (part responsible for storing post-cans)
DNS configuration
Monitoring
OONI bridges!
OONI.org website

Tier 1 (Essential) components

OONI API measurement listing
OONI Explorer
OONI Run
OONI Data analysis pipeline
Website analytics

Tier 2 (Non-Essential) components

Test list editor
Jupyter notebooks
Countly

Switch to using a secrets manager for storing credentials

As we move forward with this we are going to be having more and more credentials which we will have to use as part of the CI.

At the moment we are storing credentials inside of ooni/private, however this doesn't lend itself nicely to CD/CI since it's ties to our personal gpg keys.

We should instead move all the credentials over to a secrets manager and integrate that into the CD/CI process in here.

This might be bitwarden secrets manager or even just AWS secrets manager.

Come up with solution to prevent redeploying too often

Problem:

At the moment whenever something lands in master this triggers a rebuild and deployment of every configured service, including those which are not affected by any changes to the code.

For example if I make changes only to ooniauth, oonirun is also redeployed too. This should not really have any real impact and is not a big problem at the moment since we don't have too many services, however in the future this is problematic for 2 reasons:

A change to an unrelated service might trigger the deployment of a broken unrelated service
Even if 1. is mitigate, we are still building a new package and redeploying it for no reason (if our builds were reproducible it would be the exactly same package) costing us CPU cycles.

Possible solutions:

Add filtering to the codebuild/codepipeline tasks so that a deploy is only triggered when changes affect a specific tree (similar to how you would do this with the github: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-including-and-excluding-paths)
Split up each service in their own sub-repo

Cleanup CloudHSM setup

Checklist:

Cleanup old cluster after verifying that the keys are there (new cluster: cluster-qsvghm4oqok)
Cleanup old keys in the new cluster
Ensure the networking setup is working properly
Setup system for setting up and using the keys
Fix script for initializing the codesign box

Optional

Make the CloudHSM airgapped and document workflow for using it

We recently did a ceremony to enable our cloudhsm cluster.

During the ceremony we ran into an issue where we created by mistake additional keys that ended up not getting used.

We should do a cleanup of the keys that are in the admin role, but only after having verified that they are not needed.

Moreover, the original cluster on which we did the key generation was not highly available, however you cannot change the availability zones after it's initialized so I had to duplicate the existing cluster and restore it from backup.

We should validate that the restore from backup worked, that all the keys are in there and then delete the old cluster.

We are also charged per hour of activity of the keys, so we should come up with a nicer setup to spin up and tear down the HSM keys instead of getting charged per hour.

Already the last month we got an unusually high bill due to this.

I have verified that the keys are preserved as long as keep the cluster running and you can just spin up new HSMs modules when you need them.

Create snapshot of route53 and add to the terraform state

At the moment the records on route53 were created manually. We should create a snapshot of them and import them into the terraform state so that we have them tracked as part of our infrastructure as code allowing us to make changes to them in terraform.

I found this guide that explains how this can be done: https://www.garygitton.fr/how-to-import-aws-route53-zone-in-terraform/

Start outlining the test environment

We should start outlining the structure of the test tier, so we can start migrating services over starting from the testing environment.

As part of this we should be refactoring the current resources and terraform definitions into separate modules (following how this has been done for example for clickhouse: #4)

Implement consolidation of apidocs for OONI Services

Service docs are available at the direct endpoint address like this:

We should implement a service that merges together all the openapi specs into a single one that can be viewed by accessing:

https://api.dev.ooni.io/docs/

Bug: `vpc:*` permissions in OONIDevopsPolicy

This was reported here: #14 (comment). As it turns out vpc:* is not a valid AWS service. We want to replace this with the valid vpc-lattice:*

tighten up oonidevops_github policy

The oonidevops-github IAM user cannot run terraform plan due to missing DynamoDB permissions. We should extend the ooni-devops-github permissions policy to give sufficient access to the user to run terraform actions in the github CI

Observed here: https://github.com/ooni/devops/actions/runs/8281278038/job/22659415894#step:8:13

Consolidate ansible configuration management into ooni/devops

We should do a big cleanup of the ooni/sysadmin repo and consolidate everything that is still relevant into ooni/devops and discontinue it.

We should make sure nothing breaks in the process.

See future of ooni infrastructure design document.

Create github user for terraform runs

We want to create an IAM user which can be used by GH actions to run terraform (for plan and apply) in the dev environment. This user should have the minimal sufficient permissions to run manage AWS resources for our dev infra.

Part of: #6

Point api.dev backend-proxy frontend to backend-test.ooni.org

Currently it's pointing to backend-fsn.ooni.org, it should instead point to backend-test: https://github.com/ooni/devops/blob/main/tf/modules/ooni_backendproxy/templates/setup-backend-proxy.sh#L15

Come up with custom health checks for target groups

Currently, AWS uses a default configuration to run a health check on / path of a service and expects a 200 status code. We should extend this to use the /healthcheck path (in the services that provide this) and also improve upon the health check configuration in general.

ooni / devops Goto Github PK

devops's Issues

Plan for migration

Checklist

Tier 0 (Critical) components

Tier 1 (Essential) components

Tier 2 (Non-Essential) components

Recommend Projects

Recommend Topics

Recommend Org