Coder Social home page Coder Social logo

jadhavvikas / azdo-databricks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alexandruanghel/azdo-databricks

0.0 1.0 0.0 1.42 MB

End-to-end Azure Databricks Workspace automation with Azure Pipelines

Shell 40.27% HCL 53.91% Python 5.17% PowerShell 0.65%

azdo-databricks's Introduction

azdo-databricks


This repository contains examples of Azure DevOps (azdo) Pipelines that demonstrate how an end-to-end Azure Databricks workspace automation could be done.

The Azure Pipelines can use either Terraform or scripts (with Azure CLI and ARM templates).

The main goal is to have a Databricks Delta pipeline orchestrated with Azure Data Factory while starting with an empty Azure account (with only an empty Subscription and DevOps Organization).

  • all of this by simply running ./run_all.sh:

architecture-pipeline

Quick start

  1. Create the Subscription and DevOps Organization. If using the free tier, request a free Azure DevOps Parallelism grant by filling out the following form: https://aka.ms/azpipelines-parallelism-request

  2. Fork this GitHub repository as Azure DevOps would need access to it and changing the Azure Pipelines variables would require committing and pushing changes.

  3. Customize the variables:

    • admin setup variables: edit the admin/vars.sh file (especially SUFFIX and AZURE_DEVOPS_GITHUB_REPO_URL)
    • Azure Pipelines variables: edit the pipelines/vars.yml file, commit and push changes
  4. Use the run_all.sh script:

export USE_TERRAFORM="yes"
export AZDO_ORG_SERVICE_URL="https://dev.azure.com/myorg/"    # or set it in vars.sh
export AZDO_PERSONAL_ACCESS_TOKEN="xvwepmf76..."              # not required if USE_TERRAFORM="no"
export AZDO_GITHUB_SERVICE_CONNECTION_PAT="ghp_9xSDnG..."     # GitHub PAT

./run_all.sh

Main steps

Security was a central part in designing the main steps of this example and reflected in the minimum user privileges required for each step:

  • step 1: administrator user (Owner of Subscription and Global administrator of the AD)
  • step 2: infra service principal that is Owner of the project Resources Group
  • step 3: data service principal that can deploy and run a Data Factory pipeline

Step 1: Azure core infrastructure (admin setup)

architecture-admin

Builds the Azure core infrastructure (using a privileged user / Administrator):

  • this is the foundation for the next step: Resource Groups, Azure DevOps Project and Pipelines, Service Principals, Project group and role assignments.
  • the user creating these resources needs to be Owner of Subscription and Global administrator of the Active Directory tenant.
  • it can be seen as deploying an empty shell for a project or business unit including the Service Principal (the Infra SP) assigned to that project that would have control over the project resources.

To run this step use one of the s4cripts depending on the tool preference:

  • Terraform: ./admin/setup-with-terraform.sh (code)
  • Scripts with Azure CLI: ./admin/setup-with-azure-cli.sh (code)

Before using either, check and personalize the variables under the admin/vars.sh file.

Step 2.1: Azure infrastructure for the data pipeline and project

architecture-infra

Builds the Azure infrastructure for the data pipeline and project (using the project specific Infra SP):

  • this is the Azure infrastructure required to run a Databricks data pipeline, including Data Lake Gen 2 account and containers, Azure Data Factory, Azure Databricks workspace and Azure permissions.
  • the service principal creating these resources is the Infra SP deployed at step 1 (Resource Group owner).
  • it is run as the first stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
  • there are two Azure Pipelines yaml definitions for this deployment and either one can be used depending on the tool preference:
    • Terraform: pipelines/azure-pipelines-infra-with-terraform.yml (code)
    • ARM templates and Azure CLI: pipelines/azure-pipelines-infra-with-azure-cli.yml (code)

To run this step:

  • either use the az cli command like run_all.sh does it.
  • or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

Before using either, check and personalize the variables under the pipelines/vars.yml file (don't forget to push any changes to Git before running).

Step 2.2: Databricks workspace bootstrap

architecture-infra

This step is executed together with the one above and after deploying the Azure infrastructure and the Databricks workspace itself:

  • it bootstraps the Databricks workspace with the required workspace objects for a new project and pipeline, including Instance Pools, Clusters, Policies, Notebooks, Groups and Service Principals.
  • the service principal creating these resources is the Infra SP deployed at step 1 and is already a Databricks workspace admin since it deployed the workspace.
  • it is run as the second stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.

This is run together with previous step but if it is needed to run it separately:

  • in the Azure DevOps portal, before clicking the Run pipeline button on the Infra pipeline, deselect the Deploy infrastructure job.
  • with Terraform, use the terraform/deployments/run-deployment.sh script file.

Step 3: Azure Data Factory data pipeline with Databricks

adf-pipeline

This step is executed after the infrastructure deployment and workspace bootstrap:

  • it's a simple Azure DevOps Pipeline that deploys with ARM templates an Azure Data Factory data pipeline together with the Databricks linked service.
  • it then invokes the Azure Data Factory data pipeline with the Azure DevOps Pipeline parameters.
  • the service principal deploying and running the pipeline is the Data SP deployed at step 1 and it has the necessary Databricks and Data Factory permissions given at step 2.
  • this service principal also has the permission to write data into the Data Lake.
  • the Databricks linked service can be of two types:
    • using the Data Factory Managed Identity to authenticate to Databricks: pipelines/azure-pipelines-data-factory-msi.yml (code)
    • using an AAD Access Token of the Data SP to authenticate to Databricks: pipelines/azure-pipelines-data-factory-accesstoken.yml (code)

To run this step:

  • either use the az cli command like run_all.sh does it.
  • or use the Azure DevOps portal by clicking the Run pipeline button on the pipeline with the name defined in the AZURE_DEVOPS_DATA_PIPELINE_NAME variable.

It will use some of the variables under the pipelines/vars.yml file and it can be customized using pipeline parameters like database and table names, source data location, etc.

azdo-databricks's People

Contributors

alexandruanghel avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.