Coder Social home page Coder Social logo

taxi_rides_ny_duckdb's Introduction

Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

This project was created to accompany the PipeRider + dbt workshop on improving your code review for dbt projects.

This workshop project will run you through the following steps:

PipeRider Walkthrough

  • Initialize PipeRider inside a dbt project
  • Run PipeRider to create a data report
  • Compare data reports
  • Use a compare recipe
  • Define dbt metrics and view in report
  • Compare dbt metrics

Prequisites

Workshop Steps

1. Initial setup

  1. Fork this repo

  2. Clone your forked repo

    git clone <your-repo-url>
    cd taxi_rides_ny_duckdb
  3. Download the DuckDB database file

    wget https://dtc-workshop.s3.ap-northeast-1.amazonaws.com/nyc_taxi.duckdb
  4. Set up a new venv

    python -m venv ./venv
    source ./venv/bin/activate
  5. Update pip and install the neccessary dbt packages and PipeRider

    pip install -U pip
    pip install dbt-core dbt-duckdb 'piperider[duckdb]'
  6. Create a new branch to work on

    git switch -c data-modeling
  7. Install dbt deps and build dbt models

    dbt deps
    dbt build
  8. Initialize PipeRider

    piperider init
  9. Check PipeRider settings

    piperider diagnose

2. Run PipeRider and data model changes

  1. Run PipeRider

    piperider run

    PipeRider will profile the database and output the path to your data report, e.g.

    Generating reports from: /project/path/.piperider/outputs/latest/run.json
    Report generated in /project/path/.piperider/outputs/latest/index.html
    

    View the HTML report to see the full statistical report of your data source.

  2. Make data model changes (move statistics to their own model)

    a. Create a new model models/core/dm_monthly_zone_statistics.sql

    {{ config(materialized='table') }}
    
    with trips_data as (
    select * from {{ ref('fact_trips') }}
    )
    select
    -- Reveneue grouping
    pickup_zone as revenue_zone,
    date_trunc('month', pickup_datetime) as revenue_month,
    --Note: For BQ use instead: date_trunc(pickup_datetime, month) as revenue_month,
    
    service_type,
    
    -- Additional calculations
    count(tripid) as total_monthly_trips,
    avg(passenger_count) as avg_montly_passenger_count,
    avg(trip_distance) as avg_montly_trip_distance
    
    from trips_data
    group by 1,2,3

    b. Comment out lines 26-28 of models/core/dm_monthly_zone_revenue.sql

    -- Additional calculations
    -- count(tripid) as total_monthly_trips,
    -- avg(passenger_count) as avg_montly_passenger_count,
    -- avg(trip_distance) as avg_montly_trip_distance
  3. Rebuild the dbt models

    dbt build
  4. Run PipeRider again to generate the second data report with the new models

    piperider run
  5. Use the compare-reports function to compare the data profile reports

    piperider compare-reports --last

    The compare-reports outputs two files:

    • Comparison report: An HTML report comparing the two data profiles
    • Comparison summary: A Markdown file with a summary of changes.

    The comparison summary markdown is used to insert into a pull request (PR) comment in a later step.

  6. Commit your changes and push your branch

    git add .
    git commit -m "Added statistics model, updated revenue model"
    git push origin datamodeling
  7. Create a pull request.

    a. Visit your repo on GitHub and clck Compare & pull request

    b. Copy the contents of the comparison summary Markdown file into your pull request comment box

    c. Click preview to see how the comparison looks

    d. Click Create pull request to submit your changes

3. PipeRider Compare Recipe

In the above example we used the compare-reports command. PipeRider also has a separate compare command that uses the concept of compare 'recipes'. Recipes are a powerful way to define the specifics of how the compare will run, such as:

  • The branches to compare
  • The datasource/target to compare
  • The dbt commands to run prior to the compare

When PipeRider is initialized a default compare recipe is created. For our project this looks like:

base:
  branch: main
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run
target:
  branch: data-modeling
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run

Run the following command to run the above recipe:

piperider compare

As per the recipe, PipeRider will automatically do the following:

  1. Check out the main branch
  2. Build the models
  3. Run PipeRider
  4. Check out the data-modeling branch
  5. Build the models
  6. Run PipeRider
  7. Compare the data reports of main and data-modeling
  8. Output the compare report and summary

4. dbt-defined Metrics

PipeRider also supports profiling dbt-defined metrics. PipeRider will query dbt metrics and include them in the HTML report.

  1. Edit models/core/schema.yml and add the following code:

    metrics:
      - name: average_distance
        label: Average Distance
        model: ref('fact_trips')
        description: "The average trip distance"
    
        calculation_method: average
        expression: trip_distance
    
        timestamp: pickup_datetime
        time_grains: [month, quarter, year]
    
        tags:
        - piperider

    Important: Don't forget the piperider tag, this is how PipeRider is able to find and query your project metrics

    This defines a new metric on the fact_trips table that calculates the average trip_distance distance at the time_grains of month, quarter, and year.

  2. Run dbt compile

    dbt compile

    Note: As we’re only adding metrics, it is not necessary to build the models again with dbt build.

  3. Run PipeRider to generate a new report.

    piperider run
  4. Check the PipeRider report and click the metrics tab to view the metrics graph.

5. Filter and compare dbt metrics

PipeRider also supports comparing metrics between runs. The comarison is visualized in the Comparison report and included in the comparison summary Markdown.

  1. Edit models/core/schema.yml again and add the following filter to the metric definition:

    filters:
      - field: pickup_borough
        operator: '='
        value: "'Manhattan'"
      - field: dropoff_borough
        operator: '='
        value: "'Manhattan'"

    The filter will modify the metric to only apply to rows that meet the defined conditions - In this case, that the pickup and dropoff borough should be Manhattan.

    Your modified metric definition should now look like this:

    metrics:
      - name: average_distance
        label: Average Distance
        model: ref('fact_trips')
        description: "The average trip distance"
    
        calculation_method: average
        expression: trip_distance
    
        timestamp: pickup_datetime
        time_grains: [month, quarter, year]
    
        filters:
          - field: pickup_borough
            operator: '='
            value: "'Manhattan'"
          - field: dropoff_borough
            operator: '='
            value: "'Manhattan'"
    
        tags:
        - piperider
  2. Compile your dbt project again.

    dbt compile
  3. Run PipeRider again to generate a report with the new, filttered, metrics.

    piperider run
  4. Lastly, run the PipeRider compare-reports command to create a comparison report that will include the two, differently defined, metrics.

    piperider compare-reports --last
  5. View the newly generated comparison report to see how the metrics compare.

  6. The comparison summary also contains a summary of the metric differences between reports.

PipeRider resources:

  • Learn more about PipeRider in the docs
  • Visit the PipeRider homepage
  • Join the PipeRider Discord for help and discussion
  • Read the PipeRider blog for articles about PipeRider

taxi_rides_ny_duckdb's People

Contributors

daveflynn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.