This repository contains examples of Azure DevOps (azdo) Pipelines that demonstrate how an end-to-end Azure Databricks workspace automation could be done.
The Azure Pipelines can use either Terraform or scripts (with Azure CLI and ARM templates).
The main goal is to have a Databricks Delta pipeline orchestrated with Azure Data Factory while starting with an empty Azure account (with only an empty Subscription and DevOps Organization).
- all of this by simply running
./run_all.sh
:
-
Create the Subscription and DevOps Organization. If using the free tier, request a free Azure DevOps Parallelism grant by filling out the following form: https://aka.ms/azpipelines-parallelism-request
-
Fork this GitHub repository as Azure DevOps would need access to it and changing the Azure Pipelines variables would require committing and pushing changes.
-
Customize the variables:
-
Use the
run_all.sh
script:
export USE_TERRAFORM="yes"
export AZDO_ORG_SERVICE_URL="https://dev.azure.com/myorg/" # or set it in vars.sh
export AZDO_PERSONAL_ACCESS_TOKEN="xvwepmf76..." # not required if USE_TERRAFORM="no"
export AZDO_GITHUB_SERVICE_CONNECTION_PAT="ghp_9xSDnG..." # GitHub PAT
./run_all.sh
Security was a central part in designing the main steps of this example and reflected in the minimum user privileges required for each step:
- step 1: administrator user (
Owner
of Subscription andGlobal administrator
of the AD) - step 2: infra service principal that is Owner of the project Resources Group
- step 3: data service principal that can deploy and run a Data Factory pipeline
Builds the Azure core infrastructure (using a privileged user / Administrator):
- this is the foundation for the next step: Resource Groups, Azure DevOps Project and Pipelines, Service Principals, Project group and role assignments.
- the user creating these resources needs to be
Owner
of Subscription andGlobal administrator
of the Active Directory tenant. - it can be seen as deploying an empty shell for a project or business unit including the Service Principal (the
Infra SP
) assigned to that project that would have control over the project resources.
To run this step use one of the s4cripts depending on the tool preference:
- Terraform:
./admin/setup-with-terraform.sh
(code) - Scripts with Azure CLI:
./admin/setup-with-azure-cli.sh
(code)
Before using either, check and personalize the variables under the admin/vars.sh
file.
Builds the Azure infrastructure for the data pipeline and project (using the project specific Infra SP
):
- this is the Azure infrastructure required to run a Databricks data pipeline, including Data Lake Gen 2 account and containers, Azure Data Factory, Azure Databricks workspace and Azure permissions.
- the service principal creating these resources is the
Infra SP
deployed at step 1 (Resource Group owner). - it is run as the first stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
- there are two Azure Pipelines yaml definitions for this deployment and either one can be used depending on the tool preference:
To run this step:
- either use the az cli command like
run_all.sh
does it. - or use the Azure DevOps portal by clicking the
Run pipeline
button on the pipeline with the name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
Before using either, check and personalize the variables under the pipelines/vars.yml
file (don't forget to push any changes to Git before running).
This step is executed together with the one above and after deploying the Azure infrastructure and the Databricks workspace itself:
- it bootstraps the Databricks workspace with the required workspace objects for a new project and pipeline, including Instance Pools, Clusters, Policies, Notebooks, Groups and Service Principals.
- the service principal creating these resources is the
Infra SP
deployed at step 1 and is already a Databricks workspace admin since it deployed the workspace. - it is run as the second stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
This is run together with previous step but if it is needed to run it separately:
- in the Azure DevOps portal, before clicking the
Run pipeline
button on the Infra pipeline, deselect theDeploy infrastructure
job. - with Terraform, use the
terraform/deployments/run-deployment.sh
script file.
This step is executed after the infrastructure deployment and workspace bootstrap:
- it's a simple Azure DevOps Pipeline that deploys with ARM templates an Azure Data Factory data pipeline together with the Databricks linked service.
- it then invokes the Azure Data Factory data pipeline with the Azure DevOps Pipeline parameters.
- the service principal deploying and running the pipeline is the
Data SP
deployed at step 1 and it has the necessary Databricks and Data Factory permissions given at step 2. - this service principal also has the permission to write data into the Data Lake.
- the Databricks linked service can be of two types:
To run this step:
- either use the az cli command like
run_all.sh
does it. - or use the Azure DevOps portal by clicking the
Run pipeline
button on the pipeline with the name defined in the AZURE_DEVOPS_DATA_PIPELINE_NAME variable.
It will use some of the variables under the pipelines/vars.yml
file and it can be customized using pipeline parameters like database and table names, source data location, etc.