Coder Social home page Coder Social logo

testnation1 / azdo-databricks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alexandruanghel/azdo-databricks

0.0 0.0 0.0 1.45 MB

End-to-end Azure Databricks Workspace automation with Azure Pipelines

Shell 41.34% Python 5.45% PowerShell 0.70% HCL 52.52%

azdo-databricks's Introduction

azdo-databricks


This repository contains examples of Azure DevOps ( azdo) Pipelines that demonstrate how an end-to-end Azure Databricks workspace automation could be done.

The Azure Pipelines can use either Terraform or scripts (with Azure CLI and ARM templates).

The main goal is to have a Databricks Delta pipeline orchestrated with Azure Data Factory while starting with an empty Azure account (with only an empty Subscription and DevOps Organization).

  • all of this by simply running ./run_all.sh:

architecture-pipeline

Quick start

  1. Create the Subscription and DevOps Organization. If using the free tier, request a free Azure DevOps Parallelism grant by filling out the following form: https://aka.ms/azpipelines-parallelism-request

  2. Fork this GitHub repository as Azure DevOps would need access to it and changing the Azure Pipelines variables would require committing and pushing changes.

  3. Customize the variables:

    • admin setup variables: edit the admin/vars.sh file (especially SUFFIX and AZURE_DEVOPS_GITHUB_REPO_URL)
    • Azure Pipelines variables: edit the pipelines/vars.yml file, commit and push changes
  4. Use the run_all.sh script:

export USE_TERRAFORM="yes"
export AZDO_ORG_SERVICE_URL="https://dev.azure.com/myorg/"    # or set it in vars.sh
export AZDO_PERSONAL_ACCESS_TOKEN="xvwepmf76..."              # not required if USE_TERRAFORM="no"
export AZDO_GITHUB_SERVICE_CONNECTION_PAT="ghp_9xSDnG..."     # GitHub PAT

./run_all.sh

Main steps

Security was a central part in designing the main steps of this example and reflected in the minimum user privileges required for each step:

  • step 1: administrator user (Owner of Subscription and Global administrator of the AD)
  • step 2: infra service principal that is Owner of the project Resources Group
  • step 3: data service principal that can deploy and run a Data Factory pipeline

Step 1: Azure core infrastructure (admin setup)

architecture-admin

Builds the Azure core infrastructure (using a privileged user / Administrator):

  • this is the foundation for the next step: Resource Groups, Azure DevOps Project and Pipelines, Service Principals, Project group and role assignments.
  • the user creating these resources needs to be Owner of Subscription and Global administrator of the Active Directory tenant.
  • it can be seen as deploying an empty shell for a project or business unit including the Service Principal ( the Infra SP) assigned to that project that would have control over the project resources.

To run this step use one of the scripts depending on the tool preference:

  • Terraform: ./admin/setup-with-terraform.sh (code)
  • Scripts with Azure CLI: ./admin/setup-with-azure-cli.sh (code)

Before using either, check and personalize the variables under the admin/vars.sh file.

Step 2.1: Azure infrastructure for the data pipeline and project

architecture-infra

Builds the Azure infrastructure for the data pipeline and project (using the project specific Infra SP):

  • this is the Azure infrastructure required to run a Databricks data pipeline, including Data Lake Gen 2 account and containers, Azure Data Factory, Azure Databricks workspace and Azure permissions.
  • the service principal creating these resources is the Infra SP deployed at step 1 ( Resource Group owner).
  • it is run as the first stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
  • there are two Azure Pipelines yaml definitions for this deployment and either one can be used depending on the tool preference:

    Terraform: pipelines/azure-pipelines-infra-with-terraform.yml (code)
    • ARM templates and Azure CLI: pipelines/azure-pipelines-infra-with-azure-cli.yml (code)

To run this step:

  • either use the az cli command like run_all.sh does it.
  • or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

Before using either, check and personalize the variables under the pipelines/vars.yml file ( don't forget to push any changes to Git before running).

Step 2.2: Databricks workspace bootstrap

architecture-infra

This step is executed together with the one above and after deploying the Azure infrastructure and the Databricks workspace itself:

  • it bootstraps the Databricks workspace with the required workspace objects for a new project and pipeline, including Instance Pools, Clusters, Policies, Notebooks, Groups and Service Principals.
  • the service principal creating these resources is the Infra SP deployed at step 1 and is already a Databricks workspace admin since it deployed the workspace.
  • it is run as the second stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

This is run together with previous step but if it is needed to run it separately:

  • in the Azure DevOps portal, before clicking the Run pipeline button on the Infra pipeline, deselect the Deploy infrastructure job.
  • with Terraform, use the terraform/deployments/run-deployment.sh script file.

Step 3: Azure Data Factory data pipeline with Databricks

adf-pipeline

This step is executed after the infrastructure deployment and workspace bootstrap:

  • it's a simple Azure DevOps Pipeline that deploys with ARM templates an Azure Data Factory data pipeline together with the Databricks linked service.
  • it then invokes the Azure Data Factory data pipeline with the Azure DevOps Pipeline parameters.
  • the service principal deploying and running the pipeline is the Data SP deployed at step 1 and it has the necessary Databricks and Data Factory permissions given at step 2.
  • this service principal also has the permission to write data into the Data Lake.
  • the Databricks linked service can be of two types:
    • using the Data Factory Managed Identity to authenticate to Databricks: pipelines/azure-pipelines-data-factory-msi.yml (code)
    • using an AAD Access Token of the Data SP to authenticate to Databricks: pipelines/azure-pipelines-data-factory-accesstoken.yml (code)

To run this step:

  • either use the az cli command like run_all.sh does it.
  • or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_DATA_PIPELINE_NAME variable.

It will use some of the variables under the pipelines/vars.yml file and it can be customized using pipeline parameters like database and table names, source data location, etc.

azdo-databricks's People

Contributors

alexandruanghel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.