Coder Social home page Coder Social logo

azure / aks-baseline-automation Goto Github PK

View Code? Open in Web Editor NEW
72.0 26.0 123.0 31.63 MB

Repository for the AKS Landing Zone Accelerator program's Automation reference implementation

License: MIT License

JavaScript 0.09% HCL 5.81% Bicep 93.51% Shell 0.14% Dockerfile 0.03% Python 0.08% HTML 0.22% PowerShell 0.12%
lza

aks-baseline-automation's Introduction

AKS Baseline Automation

This repository demonstrates recommended ways to automate the deployment of the components composing a typical AKS solution.

In order to manage the complexity of a Kubernetes based solution deployment, it is best to look at it in term of a separation of concerns. Which team in an enterprise environment should be concerned with what aspect of the deployment and what tools and processes should that team employ to best achieve their objectives. This implementation and associated documentation are intended to inform the interdisciplinary teams involved in AKS deployment and lifecycle management automation. These teams may include:

  • The Infrastructure team responsible for automating the deployment of AKS and the Azure resources that it depends on, such as ACR, KeyVault, Managed Identities, Log Analytics, etc. We will provide sample code to show you how to implement such automation using Infrastructure as Code (IaC). We will use a CI/CD Pipeline built using GitHub Actions and offer you the option to choose between Bicep or Terraform for the code to deploy these resources.
  • The Networking team, which the Infrastructure team has to coordinate their activities closely with and which is responsible for all the networking components of the solution such as Vnets, DNS, App Gateways, etc.
  • The Application team responsible for automating the deployment of their application services into AKS and managing their release to production using a Blue/Green or Canary approach. We will provide sample code and guidance for how these teams can accomplish their goals by packaging their service using helm and deploying them either through a CI/CD pipeline such as GitHub Actions or a GitOp tools such as Flux or ArgoCD.
  • The Shared-Services team responsible for maintaining the overall health of the AKS clusters and the common components that run on them, such as monitoring, networking, security and other utility services. We will provide sample code and guidance for how to bootstrap these services as part of the initial AKS deployment and also how to automate their on-going life-cycle management. These Shared-Services, may be AKS add-ons such as AAD Pod identity or Secret Store CSI Driver Provider, 3rd party such as Prisma defender or Splunk daemonset, or open source such as KEDA, External-dns or Cert-manager. This team is also responsible for the lifecycle management of the clusters, such as making sure that updates/upgrades are periodically performed on the cluster, its nodes, the Shared-Services running in it and that cluster configuration changes are seamlessly conducted as needed without impacting the applications.
  • The Security team is responsible in making sure that security is built into the pipeline and all components deployed are secured by default. They will also be responsible for maintaining the Azure Policies, NSGs, firewalls rules outside the cluster as well as all security related configuration within the AKS cluster, such as Kubernetes Network Policies, RBAC or authentication and authorization rules within a Service Mesh.

Each team will be responsible for maintaining their own automation pipeline. These pipelines access to Azure should only be granted through a Service Principal, a Managed Identity or preferably a Federated Identity with the minimum set of permissions required to automatically perform the tasks that the team is responsible for.

Infrastructure as Code

This section demonstrates the implementation of a CI/CD pipeline built using GitHub Actions to automate the deployments of AKS and other Azure resources that AKS depends on. This pipeline deploys an AKS infrastructure similar to v1.24.0.0 of the AKS Baseline Reference Implementation using either Biceps or Terraform modules.

Infrastructure-as-Code

Deploy AKS using GitHub Actions and Bicep

Under the IaC/bicep folder you will find the instructions and the code to deploy the AKS Baseline Reference Implementation through a GitHub Actions pipeline leveraging bicep CARML modules. The steps can be found here.

Deploy AKS using GitHub Actions and Terraform (in development)

Under the IaC/terraform folder you will find the instructions and the code to deploy the AKS Baseline Reference Implementation through a GitHub Actions pipeline leveraging CAF Terraform modules. The steps can be found here. This option is still in development.

Shared-Services

This section demonstrates the provisioning of the Shared-Services. These services are the in-cluster common components that are used by all applications running on the cluster. We also provide example of metrics of interest from these Shared-Services that can be captured and surfaced in a dashboard to help with their maintenance.

In this section we demonstrate two implementation options:

  • A GitOps solution using the AKS Flux add-on. Refer to Shared-Services for instructions on how to set it up so that the Traefik ingress controller gets automatically deployed.
  • A CI/CD pipeline built using GitHub actions. Refer to this article for an example of a workflow to deploy an NGINX ingress controller.

The GitOps solution features:

  • An opinionated overlay structure that shows separation of concern and asset structure/management of the components that are bootstrapping the cluster.
  • Safe deployment practices with GitOps

Shared-Services Deployment

Note: in a real world deployment you may want to have a dedicated GitHub repo and an ACR instance for Shared-Services to store artifacts (i.e. manifest files, helm charts and docker images), separating them from the ones used for IaC and the application workloads. For simplicity and convenience sake, we have placed all those artifacts within this same repo but in different folders.

Application Deployment

This section demonstrates the deployment of an application composed of multiple services by leveraging two options:

  • A CI/CD pipeline built using Kubernetes GitHub Actions.

  • A GitOps solution using ArgoCD. Note that we could also have used Flux as we did to deploy the Shared Services, but the intent is to demonstrate how an app team may chose to use a separate tool for their specific workload lifecycle concerns as opposed to using the same tool as what the cluster operators use for cluster management.

The application Flask App is used for this deployment as this application is quite simple, but yet demonstrates how to deploy an application composed of multiple containers. In this case the application is composed of a web-front-end written in Python.

Blue/Green and Canary release strategies for this application will also be demonstrated. Note however that this feature has not been implemented yet, see issue #27.

Deploy sample applications using GitHub Actions (push method)

Multiple GitHub action workflows are used to demonstrate the deployment of sample applications through a CI/CD pipeline (push method). Please click on the links below for instructions on how to use these workflows.

Sample App Scenario Description Tags
Flask Hello World Docker Build Builds a container image from code on the runner then pushes to ACR. Deployment is done via a push model. Requires the use of self-hosted runners if you deployed a private ACR per the instructions in the IaC section of this repo. To setup self-hosted runners, refer to the Self-hosted GitHub Runners section.
Azure Vote AKS Run Command Sample of re-usable workflow called from the workflow App-Test-All.yml. Deploys the app using a helm chart through the AKS Command Invoke. The focus here is to demonstrate how workloads in private clusters can still be managed through cloud hosted GitHub runners (no need to install self-hosted runners as in the other samples). It also shows how to test your application using Playwright.
Azure Vote ACR Build Another Sample of re-usable workflow called from the workflow App-Test-All.yml. Builds a container image from code directly in Azure Container Registry (ACR). Deployment is done using the Azure Kubernetes GitHub actions. Requires the use of self-hosted runners if you deployed a private ACR per the instructions in the IaC section of this repo. To setup self-hosted runners, refer to the Self-hosted GitHub Runners section.

Deploy sample applications using GitOps (pull method)

You can use GitOps with flux or ArgoCD (pull method) as an alternative to GitHub action workflows to deploy your applications.

Refer to these instructions for how to setup your environment to deploy a sample application with GitOps using ArgoCD.

Lifecycle-Management

Different components of an AKS solution are often owned by different teams and typically follow their own lifecycle management schedule and process, sometimes using different tools. In this section we will cover the following lifecycle management processes:

  • Cluster lifecycle-management, such as patching nodes, upgrading AKS, adding/removing nodepools, changing min/max nb of nodes, changing nodepool subnet size, changing nodepool VM SKU, changing max pods, label/taints on nodes, adding/removing pod identities, adding/removing RBAC permissions, etc…
  • Workload lifecycle-management, such as upgrading one of the services composing the application and releasing it to production using a Blue/Green or Canary approach. External dependencies that the application may have, such as an API Management solution, a Redis cache service or a database may have their own lifecycle-management process and operated by a separate team.
  • Shared-Services lifecycle management, such as upgrading one of the Shared-Services container images to address some vulnerabilities or take advantage of some new features.

For better security and version control, all these lifecycle management processes need to be git driven so that any change to any component of the AKS solution is done through code from a Git Repository and goes through a review and approval process. For this reason, we will provide two options to automatically carry out these tasks:

  • A CI/CD pipeline built using GitHub Actions
  • A GitOps solution using flux or argoCD (applies only to Shared-Services and application workloads lifecycle management).

Note that these features have not been implemented yet in this reference implementation. For the automation of the cluster lifecycle-management see issue #23.

Secure DevOps

A typical DevOps process for deploying containers to AKS can be depicted by the diagram below: Typical DevOps

The security team focus is to make sure that security is built into this automation pipeline and that security tasks are shifted to the left and automated as much as possible. They will need for example to work with the different automation teams to make sure that the following controls are in place within their pipelines:

Secure DevOps

In addition to this oversight role, they will also have to build and maintain their own pipeline to automate the management of security related resources outside the clusters (Azure policies, firewall rules, NSGs, Azure RBAC, etc) as well as inside the cluster (Network Security Policies, Service Mesh Authentication and Authorization rules, Kubernetes RBAC, etc).

Incorporate security controls into the devOps pipeline is not implemented yet in this reference implementation, see issue #25.

GitHub Repo structure

This repository is organized as follow:

AKS Baseline Automation Repo Structure

Self-hosted GitHub Runners

The default deployment methods in this Reference Implementation use GitHub runners hosted in the GitHub Cloud.

For better security, you may want to setup GitHub self-hosted runners locally within your Azure subscription. For example, if you are using private AKS clusters, you will need to use self-hosted runners hosted in an Azure vnet with connectivity to your clusters to be able to run GitHub action workflows to manage those clusters and the workloads that run on them.

For more information about the benefits of self-hosted runners, refer to this article. For instructions on how to setup your own self-hosted runners, refer to this article.

The diagram below depicts how a GitHub runner hosted in your Azure subscription uses a Managed Identity to connect securely to your Azure subscription and make changes to your Azure and Kubernetes resources:

GitHub Runners

Contributing

This project welcomes contributions and suggestions. Please refer to the roadmap for this reference implementation under this repo's Project for the features that are planned. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

aks-baseline-automation's People

Contributors

alexandersehr avatar bahramr avatar buchatech avatar carmodyquinnms avatar ckittel avatar dependabot[bot] avatar elyusubov avatar github-actions[bot] avatar hieumoscow avatar ibersanoms avatar janitairfan avatar joselcaguilar avatar jpeasier avatar lastcoolnameleft avatar mattleach25 avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mosabami avatar oliverlabs avatar rahalan avatar tjcorr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aks-baseline-automation's Issues

DevSecOps Reference Architecture: Build

Write the section 'Build ' of the DevSecOps Reference Architecture.
Topics to consider:

  • SAST : Github codeQL
  • Secret scan: Github secret scan
  • Package/Dependecy scanning: Github dependabot
  • Dockerfile scanning: Dockle/Hadolint
  • K8S object scanning reliability/security: kube-score
  • Defender container /registry scan
  • Image signing

Azure DevOps support

Great job on creating this repo and getting it to this amazing point!

Not an issue but rather a question.

Any plans on supporting Azure DevOps Pipelines as well?
@joselcaguilar @bahramr

Thanks!

Enhanced GitHub Actions workflows for Terraform

I've been working on some reference workflows for terraform at: https://github.com/tjcorr/tf-pipeline-demo

They add some nice features like:

  • Pausing for manual approvals
  • Using OIDC to eliminate credentials
  • Output Terraform Plans to PR comments and task summaries
  • A nightly drift detection job that creates issues if anything detected
  • Terraform unit tests

Would this project be interested in looking at these changes and merging them in? I'm happy to help with a PR but wanted to validate the idea first.

DevSecOps Reference Architecture: Personas overview and responsibilities

Write the 'Personas overview and responsibilities' section of the DevSecOps Reference Architecture:
App Developers
o Write code for business features, build container image, test code
Application Operators (SRE)
o Write Kubernetes YAML for applications
o Monitor Applications
o Manage App deployments
Cluster Operators
o Manage cluster configuration & implement design
o Monitor cluster overall health
o Manage patching/permissions/rbac
o IaaC
Security
o Build in secure by design practices
o Monitor security issues

Feature Request: As a member of the Shares Service team, I need to know how to automate the deployment of in-cluster shared services using Flux, so that I don’t have to it manually

Description:
Add the code and document the steps to deploy the shared-services using GitOps with Flux. The Shared Services team is responsible for maintaining the overall health of the AKS clusters and the common components that run on them, such as monitoring, networking, security and other utility services. We need to provide sample code and guidance for how to bootstrap these services as part of the initial AKS deployment and also how to automate their on-going life-cycle management. These shared services, may be AKS add-ons such as AAD Pod identity or Secret Store CSI Driver Provider, 3rd party such as Prisma defender or Splunk daemonsets, or open source such as Traefik, KEDA, External-dns or Cert-manager.

Acceptance Criteria:

  1. Complete code for flux can be found in the shared-services folder of the repo aks-baseline-automation/shared-services. At a minimum deploy the ingress controller Traefik through Flux, meaning automating these steps: https://github.com/mspnp/aks-baseline/blob/main/09-secret-management-and-ingress-controller.md.
  2. Readme file has been created detailing how to use this deployment pattern aks-baseline-automation/shared-services

Feature Request: As a member of the Shares Service team, I need to know how to automate the deployment of in-cluster shared services using GitHub Actions, so that I don’t have to it manually.

Description:
Add the code and document the steps to deploy the shared-services using GitHub Actions. The Shared Services team is responsible for maintaining the overall health of the AKS clusters and the common components that run on them, such as monitoring, networking, security and other utility services. We need to provide sample code and guidance for how to bootstrap these services as part of the initial AKS deployment and also how to automate their on-going life-cycle management. These shared services, may be AKS add-ons such as AAD Pod identity or Secret Store CSI Driver Provider, 3rd party such as Prisma defender or Splunk daemonsets, or open source such as KEDA, External-dns or Cert-manager. Make it clear what

Acceptance Criteria:

  1. Complete github actions pipeline can be found in the workflows folder of the repo aks-baseline-automation. At a minimum deploy the ingress controller traefik through this workflow pipeline, meaning automating these steps: https://github.com/mspnp/aks-baseline/blob/main/09-secret-management-and-ingress-controller.md.
  2. Readme file has been created detailing how to use the pipeline here: https://github.com/Azure/aks-baseline-automation/blob/main/shared-services/shared-services-workflow.md

Break apart workflows to reflect real world usage

Currently the workflow deploys an entire stack with a hub, one spoke, one ACR, one AKV and one AKS. This does not reflect real world usage where customers will want a few hubs (or one) and multiple spokes (each of which may have a varying number of AKVs, AKS, and ACRs (or ACRs could be in a shared spoke)).
Because we want to reflect real world usage we need to break apart the workflows. This Issue is intended to break apart workflows so that there is an overarching workflow that calls the sub workflows. Make sure for reduction of maintenance to remove redundancy so that you only have to work on a single workflow to make any changes to the individual component.

The resulting workflows should be

Workflow Name: Yaml File Name (please note there is a tf and bicep version for each)
Deploy Full Stack: IaC-bicep-FullStack.yaml and IaC-terraform-FullStack.yaml
Deploy Hub: IaC--hub.yaml
Deploy Spoke: IaC-
-spoke.yaml
Deploy ACR: IaC--acr.yaml
Deploy AKV: IaC-
-akv.yaml
Deploy AKS: IaC-*-aks.yaml

(please note we have to figure out how to handle the networking and which part goes where - I'm pretty sure we may end up with some other workflows related to networking).

Feature Request: As a member of the Infrastructure Team responsible for automation, I need to have a Git driven automation process for managing the lifecycle of my clusters so that I don’t have to use the portal or the CLI to do it manually.

Description:
Add the code and document the steps to manage the cluster lifecycle, which includes patching nodes, upgrading AKS, adding/removing nodepools, changing min/max nb of nodes, changing nodepool subnet size, changing nodepool VM SKU, changing max pods, label/taints on nodes, adding/removing pod identities, adding/removing RBAC permissions, etc.

Acceptance Criteria:

  1. Readme file in root directory should detail why using Git for lifecycle management is important
  2. Detailed readme for cluster lifecycle management using Git
  3. Add artifacts required for this to be successful in the lifecycle management folder

PR template

The repo should have a PR template to assist contributors raising PR's with empty descriptions.

DevSecOps Reference Architecture: Deploy

Write the section 'Deploy ' of the DevSecOps Reference Architecture

Topics to consider:
-Github Secure pipelines/approvals
-Secure deployment credentials or GitOps PR/GitHub branch security
o GitOps
o Push deployment
-DAST: OWASP ZAP

Bicep warnings - Environment URLs

The bicep linter has highlighted these issues

  1. Warning no-hardcoded-env-urls: Environment URLs should not be hardcoded. Use the environment() function to ensure compatibility across clouds. Found this disallowed host: "management.azure.com"
  2. Warning no-hardcoded-env-urls: Environment URLs should not be hardcoded. Use the environment() function to ensure compatibility across clouds. Found this disallowed host: "login.microsoftonline.com" [https://aka.ms/bicep/linter/no-hardcoded-env-urls]

DevSecOps Reference Architecture: Plan and Design

Write the section 'Plan and Design ' of the DevSecOps Reference Architecture

Design choices / reference to existing areas:

  • Network Security: WAF, NSG, Network Policy, Service Mesh, Private cluster
  • Identity: Auth/Auth, MFA on AAD
  • Secret Storage: Azure Keyvault CSI
  • Threat modeling

Prerequest document update request

Hi, I was trying this pre-configured AKS construction setting, and it was throwing error like this:

The VM size of Standard_B4ms is not allowed in your subscription in location 'eastus2'. 
The available VM sizes are 
'standard_d11,standard_d12,standard_d13,standard_d14,standard_d16lds_v5,standard_d16ls_v5,standard_d16pds_v5,standard_d16plds_v5,standard_d16pls_v5,standard_d16ps_v5,standard_d2,standard_d2lds_v5,standard_d2ls_v5,standard_d2pds_v5,standard_d2plds_v5,standard_d2pls_v5,standard_d2ps_v5,standard_d3,standard_d32lds_v5,standard_d32ls_v5,standard_d32pds_v5,standard_d32plds_v5,standard_d32pls_v5,standard_d32ps_v5,standard_d4,standard_d48lds_v5,standard_d48ls_v5,standard_d48pds_v5,standard_d48plds_v5,standard_d48pls_v5,standard_d48ps_v5,standard_d4lds_v5,standard_d4ls_v5,standard_d4pds_v5,standard_d4plds_v5,standard_d4pls_v5,standard_d4ps_v5,standard_d64lds_v5,standard_d64ls_v5,standard_d64pds_v5,standard_d64plds_v5,standard_d64pls_v5,standard_d64ps_v5,standard_d8lds_v5,standard_d8ls_v5,standard_d8pds_v5,standard_d8plds_v5,standard_d8pls_v5,standard_d8ps_v5,standard_d96lds_v5,standard_d96ls_v5,standard_dc16ds_v3,standard_dc16s_v3,standard_dc24ds_v3,standard_dc24s_v3,standard_dc2as_v5,standard_dc2ds_v3,standard_dc2s_v3,standard_dc32ds_v3,standard_dc32s_v3,standard_dc48ds_v3,standard_dc48s_v3,standard_dc4ds_v3,standard_dc4s_v3,standard_dc8ds_v3,standard_dc8s_v3,standard_ds11,standard_ds12,standard_ds13,standard_ds14,standard_ds2,standard_ds3,standard_ds4,standard_e112iads_v5,standard_e112ias_v5,standard_e16pds_v5,standard_e16ps_v5,standard_e20pds_v5,standard_e20ps_v5,standard_e2pds_v5,standard_e2ps_v5,standard_e32pds_v5,standard_e32ps_v5,standard_e4pds_v5,standard_e4ps_v5,standard_e8pds_v5,standard_e8ps_v5,standard_ec8ads_v5,standard_nc24ads_a100_v4,standard_nc48ads_a100_v4,standard_nc96ads_a100_v4,standard_nv12s_v2,standard_nv24s_v2,standard_nv6s_v2' 

At first, I thought there were problems with main.json or the configuration of the AKS clusters. But after going through multiple deployments & cleaning ups, it was just my Microsoft internal subscription, which had limits with multiple VM sizes. I needed to look up main.js's default VM skus and match them with my available VM sku.

And I thought this wouldn't be just my problem for those who are trying to walk through the CI/CD demo for AKS.

What I wanted to suggest is, maybe writing extra instructions at the prerequest README such as:

Please check available skus for VMs using these commands before deploying the auto-generated construction commands. (And maybe list up the VM skus used in main.json so that the users do not have to look them up by themselves, and focus on the demo itself)

(bash) az vm list-skus --location eastus2 -o table

would definitely help others who stumble upon the same problem as me. (Especially MS internally, as MCAPS subscription seems to have a lot of restrictions)

DevSecOps Reference Architecture: Develop

-IDE Security plugins, Linting tools (CodeQL)
-Pre-commit hook validation
-Code Reviewers
-Github branch policies
o Min reviewers
o Rbac/protection/separation of duties
-Choosing slim / alpine images or ‘distroless’ images

DevSecOps Reference Architecture: Operate & Monitor

Topics to consider:

  • Defender runtime security
  • Aks policy addon / Compliance
    o Container Security profiles, privilege, uuid/guid
  • Continuous monitoring & alerting
    o Azure Monitor
    o Security Center
    o Sentinel
    o Compliance/Policy Center

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.