devops's Issues
Optimize load balancer configuration
On AWS you pay the load balancers by the hour, irrespective of how much bandwidth goes over them.
At the moment we are using load balancers and target groups to route traffic to individual microservices. For the direct access, which would not be used in production, we should remove the load balancer in front of them, but rather have direct communication to the container.
This will likely involve doing some refactoring of the networking setup, since we need to configure direct connectivity to each host.
Move test helpers from digital ocean to AWS
Test helper rotation script is broken and manual changes were made to DNS to unbrick it on 18th March 2024: https://openobservatory.slack.com/archives/C38EJ0CET/p1710780947922739.
Following this incident the NS delegation of th.ooni.org has been migrated over to AWS, which currently hosts the following A records:
0.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000
1.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
2.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
3.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000
Note that 1 and 2 and 0 and 3 point to the same IP, because there were only 2 running VPS that were not broken from the auto rotation script.
Plan for migration
We plan to migrate all these test helpers over to the AWS ECS based configuration, see: https://github.com/ooni/devops/blob/main/tf/environments/prod/main.tf#L505.
All the previous addresses will be configured to point to ALB entry (see: https://github.com/ooni/devops/blob/main/tf/modules/oonith_service/main.tf#L176) for the oonith_service as aliases (effectively it behaves like a CNAME, but costs less: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html).
Checklist
- Add support for IPv6 connectivity on test helpers
- Setup 4.th.ooni.org on AWS (done 19th April 2024)
- Update check-in to return 4.th.ooni.org (done 22nd April 2024)
- Drop test helper migration script from backend-fsn (done 22nd April 2024)
- Drop 3.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: #48
- Monitor failure rate for 3.th.ooni.org
- Monitor load of test helper to see if capacity is enough
- Bump up capacity of machine and ensure that it’s increased with zero downtime
- Edit backend config to return only 1,2,3,4.th.ooni.org (done 10:45 23th April 2024 CEST): ooni/backend#838
- Drop 0.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service
- Monitor failure rate for 0.th.ooni.org\
- Edit backend config to return only 0,3,4.th.ooni.org: ooni/backend#840
- Drop 1-2.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: #52
- Monitor failure rate for 1-2.th.ooni.org
- Delete all test helper related hosts from digital ocean
Implement API gateway for api.ooni.io
Basically we should setup in terraform an API gateway that allows us to route traffic to the new oonidatapi, but keep most of it pointing still to the old legacy instance on backend-fsn.
This would allow us to centrally manage the paths and handle failing over to the old one if something breaks badly.
I will drop in here some links to stuff to look at:
Port oohelperd deployment over to new pattern
This is about doing the following:
- Creating dockerfile for oohelperd, see: ooni/backend#827
- Creating codebuild/codepipeline buildspec to build and deploy it, see: ooni/backend#827
- Adding relevant modules to ooni/devops to deploy the oohelperd to a new ECS cluster following the template of how we did it for ooniapi-services. We should copy those modules and customize them to deploying this golang image.
For the last point we should follow the template of:
- ooniapi_service
- ooniapi_service_deployer
And make use of ecs_cluster modules to create: oonith_service
oonith_service_deployer
to create a ECS task definition and builds the dockerfile I wrote above to deploy to a new ecs_cluster the oonihelperd test helper
Investigate how autoscaling works in our ECS deployments
We would like to have the autoscaling groups adjust the number of service instance and ec2 instances to factor for increased load or usage.
We should investigate how we have it setup at the moment and evaluate if some tuning to the autoscaling group is needed in order for it to dynamically adjust.
See:
Fix terraform GH workflow
We want to fix the check_terraform.yml
workflow.
Move Code Build and Code Pipeline setup to terraform
As part of ooni/backend#796 I wrote the Code Build and Code Pipeline workflows directly from the AWS UI (it was easier to understand what was going on and how they worked from the web interface).
We should at some point move that into the terraform configuration so that it's reproducibly deployed and can be easily extended to other projects without needing to click through UIs.
Come up with list of tiers
Here is a tentative list:
Tier 0 (Critical) components
- Probe Services (collector specifically)
- Fastpath (part responsible for storing post-cans)
- DNS configuration
- Monitoring
- OONI bridges!
- OONI.org website
Tier 1 (Essential) components
- OONI API measurement listing
- OONI Explorer
- OONI Run
- OONI Data analysis pipeline
- Website analytics
Tier 2 (Non-Essential) components
- Test list editor
- Jupyter notebooks
- Countly
Switch to using a secrets manager for storing credentials
As we move forward with this we are going to be having more and more credentials which we will have to use as part of the CI.
At the moment we are storing credentials inside of ooni/private, however this doesn't lend itself nicely to CD/CI since it's ties to our personal gpg keys.
We should instead move all the credentials over to a secrets manager and integrate that into the CD/CI process in here.
This might be bitwarden secrets manager or even just AWS secrets manager.
Come up with solution to prevent redeploying too often
Problem:
At the moment whenever something lands in master this triggers a rebuild and deployment of every configured service, including those which are not affected by any changes to the code.
For example if I make changes only to ooniauth
, oonirun
is also redeployed too. This should not really have any real impact and is not a big problem at the moment since we don't have too many services, however in the future this is problematic for 2 reasons:
- A change to an unrelated service might trigger the deployment of a broken unrelated service
- Even if 1. is mitigate, we are still building a new package and redeploying it for no reason (if our builds were reproducible it would be the exactly same package) costing us CPU cycles.
Possible solutions:
- Add filtering to the codebuild/codepipeline tasks so that a deploy is only triggered when changes affect a specific tree (similar to how you would do this with the github: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-including-and-excluding-paths)
- Split up each service in their own sub-repo
Cleanup CloudHSM setup
Checklist:
- Cleanup old cluster after verifying that the keys are there (new cluster:
cluster-qsvghm4oqok
) - Cleanup old keys in the new cluster
- Ensure the networking setup is working properly
- Setup system for setting up and using the keys
- Fix script for initializing the codesign box
Optional
- Make the CloudHSM airgapped and document workflow for using it
We recently did a ceremony to enable our cloudhsm cluster.
During the ceremony we ran into an issue where we created by mistake additional keys that ended up not getting used.
We should do a cleanup of the keys that are in the admin role, but only after having verified that they are not needed.
Moreover, the original cluster on which we did the key generation was not highly available, however you cannot change the availability zones after it's initialized so I had to duplicate the existing cluster and restore it from backup.
We should validate that the restore from backup worked, that all the keys are in there and then delete the old cluster.
We are also charged per hour of activity of the keys, so we should come up with a nicer setup to spin up and tear down the HSM keys instead of getting charged per hour.
Already the last month we got an unusually high bill due to this.
I have verified that the keys are preserved as long as keep the cluster running and you can just spin up new HSMs modules when you need them.
Create snapshot of route53 and add to the terraform state
At the moment the records on route53 were created manually. We should create a snapshot of them and import them into the terraform state so that we have them tracked as part of our infrastructure as code allowing us to make changes to them in terraform.
I found this guide that explains how this can be done: https://www.garygitton.fr/how-to-import-aws-route53-zone-in-terraform/
Start outlining the test environment
We should start outlining the structure of the test tier, so we can start migrating services over starting from the testing environment.
As part of this we should be refactoring the current resources and terraform definitions into separate modules (following how this has been done for example for clickhouse: #4)
Implement consolidation of apidocs for OONI Services
Service docs are available at the direct endpoint address like this:
We should implement a service that merges together all the openapi specs into a single one that can be viewed by accessing:
Bug: `vpc:*` permissions in OONIDevopsPolicy
This was reported here: #14 (comment). As it turns out vpc:*
is not a valid AWS service. We want to replace this with the valid vpc-lattice:*
tighten up oonidevops_github policy
The oonidevops-github
IAM user cannot run terraform plan
due to missing DynamoDB permissions. We should extend the ooni-devops-github
permissions policy to give sufficient access to the user to run terraform actions in the github CI
Observed here: https://github.com/ooni/devops/actions/runs/8281278038/job/22659415894#step:8:13
Consolidate ansible configuration management into ooni/devops
We should do a big cleanup of the ooni/sysadmin repo and consolidate everything that is still relevant into ooni/devops
and discontinue it.
We should make sure nothing breaks in the process.
See future of ooni infrastructure design document.
Create github user for terraform runs
We want to create an IAM user which can be used by GH actions to run terraform (for plan
and apply
) in the dev
environment. This user should have the minimal sufficient permissions to run manage AWS resources for our dev infra.
Part of: #6
Point api.dev backend-proxy frontend to backend-test.ooni.org
Currently it's pointing to backend-fsn.ooni.org, it should instead point to backend-test: https://github.com/ooni/devops/blob/main/tf/modules/ooni_backendproxy/templates/setup-backend-proxy.sh#L15
Come up with custom health checks for target groups
Currently, AWS uses a default configuration to run a health check on /
path of a service and expects a 200
status code. We should extend this to use the /healthcheck
path (in the services that provide this) and also improve upon the health check configuration in general.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.