azdo-databricks

This repository contains examples of Azure DevOps (azdo) Pipelines that demonstrate how an end-to-end Azure Databricks workspace automation could be done.

The Azure Pipelines can use either Terraform or scripts (with Azure CLI and ARM templates).

The main goal is to have a Databricks Delta pipeline orchestrated with Azure Data Factory while starting with an empty Azure account (with only an empty Subscription and DevOps Organization).

all of this by simply running ./run_all.sh:

Quick start

Create the Subscription and DevOps Organization. If using the free tier, request a free Azure DevOps Parallelism grant by filling out the following form: https://aka.ms/azpipelines-parallelism-request
Fork this GitHub repository as Azure DevOps would need access to it and changing the Azure Pipelines variables would require committing and pushing changes.
Customize the variables:
- admin setup variables: edit the admin/vars.sh file (especially SUFFIX and AZURE_DEVOPS_GITHUB_REPO_URL)
- Azure Pipelines variables: edit the pipelines/vars.yml file, commit and push changes
Use the run_all.sh script:

export USE_TERRAFORM="yes"
export AZDO_ORG_SERVICE_URL="https://dev.azure.com/myorg/"    # or set it in vars.sh
export AZDO_PERSONAL_ACCESS_TOKEN="xvwepmf76..."              # not required if USE_TERRAFORM="no"
export AZDO_GITHUB_SERVICE_CONNECTION_PAT="ghp_9xSDnG..."     # GitHub PAT

./run_all.sh

Main steps

Security was a central part in designing the main steps of this example and reflected in the minimum user privileges required for each step:

step 1: administrator user (Owner of Subscription and Global administrator of the AD)
step 2: infra service principal that is Owner of the project Resources Group
step 3: data service principal that can deploy and run a Data Factory pipeline

Step 1: Azure core infrastructure (admin setup)

Builds the Azure core infrastructure (using a privileged user / Administrator):

this is the foundation for the next step: Resource Groups, Azure DevOps Project and Pipelines, Service Principals, Project group and role assignments.
the user creating these resources needs to be Owner of Subscription and Global administrator of the Active Directory tenant.
it can be seen as deploying an empty shell for a project or business unit including the Service Principal (the Infra SP) assigned to that project that would have control over the project resources.

To run this step use one of the s4cripts depending on the tool preference:

Terraform: ./admin/setup-with-terraform.sh (code)
Scripts with Azure CLI: ./admin/setup-with-azure-cli.sh (code)

Before using either, check and personalize the variables under the admin/vars.sh file.

Step 2.1: Azure infrastructure for the data pipeline and project

Builds the Azure infrastructure for the data pipeline and project (using the project specific Infra SP):

this is the Azure infrastructure required to run a Databricks data pipeline, including Data Lake Gen 2 account and containers, Azure Data Factory, Azure Databricks workspace and Azure permissions.
the service principal creating these resources is the Infra SP deployed at step 1 (Resource Group owner).
it is run as the first stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
there are two Azure Pipelines yaml definitions for this deployment and either one can be used depending on the tool preference:
- Terraform: pipelines/azure-pipelines-infra-with-terraform.yml (code)
- ARM templates and Azure CLI: pipelines/azure-pipelines-infra-with-azure-cli.yml (code)

To run this step:

either use the az cli command like run_all.sh does it.
or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

Before using either, check and personalize the variables under the pipelines/vars.yml file (don't forget to push any changes to Git before running).

Step 2.2: Databricks workspace bootstrap

This step is executed together with the one above and after deploying the Azure infrastructure and the Databricks workspace itself:

it bootstraps the Databricks workspace with the required workspace objects for a new project and pipeline, including Instance Pools, Clusters, Policies, Notebooks, Groups and Service Principals.
the service principal creating these resources is the Infra SP deployed at step 1 and is already a Databricks workspace admin since it deployed the workspace.
it is run as the second stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

This is run together with previous step but if it is needed to run it separately:

in the Azure DevOps portal, before clicking the Run pipeline button on the Infra pipeline, deselect the Deploy infrastructure job.
with Terraform, use the terraform/deployments/run-deployment.sh script file.

Step 3: Azure Data Factory data pipeline with Databricks

This step is executed after the infrastructure deployment and workspace bootstrap:

it's a simple Azure DevOps Pipeline that deploys with ARM templates an Azure Data Factory data pipeline together with the Databricks linked service.
it then invokes the Azure Data Factory data pipeline with the Azure DevOps Pipeline parameters.
the service principal deploying and running the pipeline is the Data SP deployed at step 1 and it has the necessary Databricks and Data Factory permissions given at step 2.
this service principal also has the permission to write data into the Data Lake.
the Databricks linked service can be of two types:
- using the Data Factory Managed Identity to authenticate to Databricks: pipelines/azure-pipelines-data-factory-msi.yml (code)
- using an AAD Access Token of the Data SP to authenticate to Databricks: pipelines/azure-pipelines-data-factory-accesstoken.yml (code)

To run this step:

either use the az cli command like run_all.sh does it.
or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_DATA_PIPELINE_NAME variable.

It will use some of the variables under the pipelines/vars.yml file and it can be customized using pipeline parameters like database and table names, source data location, etc.

jadhavvikas / azdo-databricks Goto Github PK

azdo-databricks's Introduction

azdo-databricks

Quick start

Main steps

Step 1: Azure core infrastructure (admin setup)

Step 2.1: Azure infrastructure for the data pipeline and project

Step 2.2: Databricks workspace bootstrap

Step 3: Azure Data Factory data pipeline with Databricks

azdo-databricks's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent