Coder Social home page Coder Social logo

devops's Introduction

OONI Devops

Infrastructure Tiers

We divide our infrastructure components into 3 tiers:

  • Tier 0: Critical: These are mission critical infrastructure components. If these become unavailable or have significant disruption, it will have a major impact.

  • Tier 1: Essential: These components are important, but not as critical as tier 0. They are part of our core operations, but if they become unavailable the impact is important, but not major.

  • Tier 2: Non-Essential: These are auxiliary components. Their unavailability does not have a major impact.

Tier 0 (Critical) components

  • Probe Services (collector specifically)
  • Fastpath (part responsible for storing post-cans)
  • DNS configuration
  • Monitoring
  • OONI bridges
  • OONI.org website

Tier 1 (Essential) components

  • OONI API measurement listing
  • OONI Explorer
  • OONI Run
  • OONI Data analysis pipeline
  • Website analytics

Tier 2 (Non-Essential) components

  • Test list editor
  • Jupyter notebooks
  • Countly

devops's People

Contributors

hellais avatar decfox avatar majakomel avatar

Watchers

Simone Basso avatar  avatar  avatar Maria Xynou avatar Elizaveta avatar Norbel AMBANUMBEN avatar  avatar

Forkers

decfox

devops's Issues

Implement API gateway for api.ooni.io

Basically we should setup in terraform an API gateway that allows us to route traffic to the new oonidatapi, but keep most of it pointing still to the old legacy instance on backend-fsn.

This would allow us to centrally manage the paths and handle failing over to the old one if something breaks badly.

I will drop in here some links to stuff to look at:

Move Code Build and Code Pipeline setup to terraform

As part of ooni/backend#796 I wrote the Code Build and Code Pipeline workflows directly from the AWS UI (it was easier to understand what was going on and how they worked from the web interface).

We should at some point move that into the terraform configuration so that it's reproducibly deployed and can be easily extended to other projects without needing to click through UIs.

Optimize load balancer configuration

On AWS you pay the load balancers by the hour, irrespective of how much bandwidth goes over them.

At the moment we are using load balancers and target groups to route traffic to individual microservices. For the direct access, which would not be used in production, we should remove the load balancer in front of them, but rather have direct communication to the container.

This will likely involve doing some refactoring of the networking setup, since we need to configure direct connectivity to each host.

Consolidate ansible configuration management into ooni/devops

We should do a big cleanup of the ooni/sysadmin repo and consolidate everything that is still relevant into ooni/devops and discontinue it.

We should make sure nothing breaks in the process.

See future of ooni infrastructure design document.

Port oohelperd deployment over to new pattern

This is about doing the following:

  • Creating dockerfile for oohelperd, see: ooni/backend#827
  • Creating codebuild/codepipeline buildspec to build and deploy it, see: ooni/backend#827
  • Adding relevant modules to ooni/devops to deploy the oohelperd to a new ECS cluster following the template of how we did it for ooniapi-services. We should copy those modules and customize them to deploying this golang image.

For the last point we should follow the template of:

to create a ECS task definition and builds the dockerfile I wrote above to deploy to a new ecs_cluster the oonihelperd test helper

Investigate how autoscaling works in our ECS deployments

We would like to have the autoscaling groups adjust the number of service instance and ec2 instances to factor for increased load or usage.

We should investigate how we have it setup at the moment and evaluate if some tuning to the autoscaling group is needed in order for it to dynamically adjust.

See:

Start outlining the test environment

We should start outlining the structure of the test tier, so we can start migrating services over starting from the testing environment.

As part of this we should be refactoring the current resources and terraform definitions into separate modules (following how this has been done for example for clickhouse: #4)

Come up with solution to prevent redeploying too often

Problem:

At the moment whenever something lands in master this triggers a rebuild and deployment of every configured service, including those which are not affected by any changes to the code.

For example if I make changes only to ooniauth, oonirun is also redeployed too. This should not really have any real impact and is not a big problem at the moment since we don't have too many services, however in the future this is problematic for 2 reasons:

  1. A change to an unrelated service might trigger the deployment of a broken unrelated service
  2. Even if 1. is mitigate, we are still building a new package and redeploying it for no reason (if our builds were reproducible it would be the exactly same package) costing us CPU cycles.

Possible solutions:

Come up with custom health checks for target groups

Currently, AWS uses a default configuration to run a health check on / path of a service and expects a 200 status code. We should extend this to use the /healthcheck path (in the services that provide this) and also improve upon the health check configuration in general.

Come up with list of tiers

Here is a tentative list:

Tier 0 (Critical) components

  • Probe Services (collector specifically)
  • Fastpath (part responsible for storing post-cans)
  • DNS configuration
  • Monitoring
  • OONI bridges!
  • OONI.org website

Tier 1 (Essential) components

  • OONI API measurement listing
  • OONI Explorer
  • OONI Run
  • OONI Data analysis pipeline
  • Website analytics

Tier 2 (Non-Essential) components

  • Test list editor
  • Jupyter notebooks
  • Countly

Cleanup CloudHSM setup

Checklist:

  • Cleanup old cluster after verifying that the keys are there (new cluster: cluster-qsvghm4oqok)
  • Cleanup old keys in the new cluster
  • Ensure the networking setup is working properly
  • Setup system for setting up and using the keys
  • Fix script for initializing the codesign box

Optional

  • Make the CloudHSM airgapped and document workflow for using it

We recently did a ceremony to enable our cloudhsm cluster.

During the ceremony we ran into an issue where we created by mistake additional keys that ended up not getting used.

We should do a cleanup of the keys that are in the admin role, but only after having verified that they are not needed.

Moreover, the original cluster on which we did the key generation was not highly available, however you cannot change the availability zones after it's initialized so I had to duplicate the existing cluster and restore it from backup.

We should validate that the restore from backup worked, that all the keys are in there and then delete the old cluster.

We are also charged per hour of activity of the keys, so we should come up with a nicer setup to spin up and tear down the HSM keys instead of getting charged per hour.

Already the last month we got an unusually high bill due to this.

I have verified that the keys are preserved as long as keep the cluster running and you can just spin up new HSMs modules when you need them.

Switch to using a secrets manager for storing credentials

As we move forward with this we are going to be having more and more credentials which we will have to use as part of the CI.

At the moment we are storing credentials inside of ooni/private, however this doesn't lend itself nicely to CD/CI since it's ties to our personal gpg keys.

We should instead move all the credentials over to a secrets manager and integrate that into the CD/CI process in here.

This might be bitwarden secrets manager or even just AWS secrets manager.

Move test helpers from digital ocean to AWS

Test helper rotation script is broken and manual changes were made to DNS to unbrick it on 18th March 2024: https://openobservatory.slack.com/archives/C38EJ0CET/p1710780947922739.

Following this incident the NS delegation of th.ooni.org has been migrated over to AWS, which currently hosts the following A records:
0.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000
​​1.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
2.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
3.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000

Note that 1 and 2 and 0 and 3 point to the same IP, because there were only 2 running VPS that were not broken from the auto rotation script.

Plan for migration

We plan to migrate all these test helpers over to the AWS ECS based configuration, see: https://github.com/ooni/devops/blob/main/tf/environments/prod/main.tf#L505.

All the previous addresses will be configured to point to ALB entry (see: https://github.com/ooni/devops/blob/main/tf/modules/oonith_service/main.tf#L176) for the oonith_service as aliases (effectively it behaves like a CNAME, but costs less: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html).

Checklist

  • Add support for IPv6 connectivity on test helpers
  • Setup 4.th.ooni.org on AWS (done 19th April 2024)
  • Update check-in to return 4.th.ooni.org (done 22nd April 2024)
  • Drop test helper migration script from backend-fsn (done 22nd April 2024)
  • Drop 3.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: #48
  • Monitor failure rate for 3.th.ooni.org
  • Monitor load of test helper to see if capacity is enough
  • Bump up capacity of machine and ensure that it’s increased with zero downtime
  • Edit backend config to return only 1,2,3,4.th.ooni.org (done 10:45 23th April 2024 CEST): ooni/backend#838
  • Drop 0.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service
  • Monitor failure rate for 0.th.ooni.org\
  • Edit backend config to return only 0,3,4.th.ooni.org: ooni/backend#840
  • Drop 1-2.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: #52
  • Monitor failure rate for 1-2.th.ooni.org
  • Delete all test helper related hosts from digital ocean

Create github user for terraform runs

We want to create an IAM user which can be used by GH actions to run terraform (for plan and apply) in the dev environment. This user should have the minimal sufficient permissions to run manage AWS resources for our dev infra.

Part of: #6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.