Coder Social home page Coder Social logo

microsoft / azure-databricks-nyc-taxi-workshop Goto Github PK

View Code? Open in Web Editor NEW
99.0 18.0 104.0 43.33 MB

An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

License: MIT License

Python 48.18% Scala 51.82%
azure-databricks pyspark azure-machine-learning-services azure-machine-learning scala

azure-databricks-nyc-taxi-workshop's Introduction

Azure Databricks NYC Taxi Workshop

This is a multi-part (free) workshop featuring Azure Databricks. It covers basics of working with Azure Data Services from Spark on Databricks with Chicago crimes public dataset, followed by an end-to-end data engineering workshop with the NYC Taxi public dataset, and finally an end-to-end machine learning workshop. The workshop is offered in Scala and Python.

The goal of this workshop is deliver a clear understanding of how to provision Azure data services, how the data services services integrate with Spark on Azure Databricks, to give you end to end experience with basic data engineering and basic data science on Azure Databricks, and to share some boilerplate code to use in your projects.

This is a community contribution, so we appreciate feedback and contribution.

Target Audience

  • Architects
  • Data Engineers
  • Data Scientists

Pre-Requisite Knowledge

  • Prior knowledge of Spark, is beneficial
  • Familiarity/experience with Scala/Python

Azure Pre-Requisites

A subscription with at least $200 credit for a continuous 10-14 hours of usage.

1. Module 1 - Primer

This module covers basics of integrating with Azure Data Services from Spark on Azure Databricks in batch mode and with structured streaming.

primer

At the end of this module, you will know how to provision, configure, and integrate from Spark with:

  1. Azure storage - blob storage, ADLS gen1 and ADLS gen2; Includes Databricks Delta as well
  2. Azure Event Hub - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
  3. HDInsight Kafka - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
  4. Azure SQL database - read/write primer in batch and structured streaming
  5. Azure SQL datawarehouse - read/write primer in batch and structured streaming
  6. Azure Cosmos DB (core API - SQL API/document oriented) - read/write primer in batch and structured streaming; Includes structured streaming aggregation computation
  7. Azure Data Factory - automating Spark notebooks in Azure Databricks with Azure Data Factory version 2
  8. Azure Key Vault for secrets management

The Chicago crimes dataset is leveraged in the lab.

2. Module 2 - Data Engineering Workshop

This is a batch focused module and covers building blocks of standing up a data engineering pipeline. The NYC taxi dataset (yellow and green taxi trips) is leveraged in the labs.

primer primer

3. Module 3 - Data Science Workshop

There are two versions of the Data Science Workshop - the one using Scala will show Spark MLLib models. The PySpark version will show Spark ML and Azure Machine Learning services working together.

If you would like to run Module 3 as standalone, you'll need to:

  1. Provision:
    1. Azure Databricks
    2. Azure Storage account
    3. Azure Machine Learning services Workspace
  2. Import the DBC file into the Databricks workspace
  3. Set the module_3_only flag in 99-Shared-Functions-and-Settings to True

The following is a summary of content covered:

  1. Perform feature engineering and feature selection activities
  2. Create an Azure Machine Learning (AML) service workspace
  3. Connect to an AML workspace
  4. Create PySpark models and leverage AML Experiment tracking
  5. Leverage Automated ML capabilities in AML
  6. Deploy the best performing model as a REST API in a Docker continer

Next

Credits

  • Anagha Khanolkar (Chicago) - creator of workshop, primary author of workshop, content design, all development in Scala, primer module in Pyspark
  • Ryan Murphy (St Louis) - contribution to the data engineering workshop transformation rules, schema and more
  • Rajdeep Biswas (Houston) - writing the entire PySpark version of the data engineering lab
  • Steve Howard (St Louis) - contributing to the PySpark version of the data engineering lab
  • Erik Zwiefel (Minneapolis) - content design of data science lab, PySpark version, Azure Machine Learning service integration for operationalization as a REST service, AutoML
  • Thomas Abraham (St Louis) - development of ADFv2 integration primer in Pyspark
  • Matt Stenzel, Christopher House (Minneapolis) - testing

azure-databricks-nyc-taxi-workshop's People

Contributors

ezwiefel avatar microsoft-github-policy-service[bot] avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-databricks-nyc-taxi-workshop's Issues

Create ARM Template for Required Deployments

Create an ARM Template for the required services to make it easier to get up and running.

Many of the potential attendees (especially in the Data Science track) won't provision items directly in Azure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.