Coder Social home page Coder Social logo

genemerrill / dataform-scd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dataform-co/dataform-scd

0.0 0.0 0.0 26 KB

Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.

Home Page: https://dataform.co

License: MIT License

JavaScript 100.00%

dataform-scd's Introduction

Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.

Supported warehouses

  • BigQuery
  • Redshift/PG
  • Snowflake

If you would like us to add support for another warehouse, please get in touch via email or Slack

Installation

Add the package to your package.json file in your Dataform project. You can find the most up to package version on the releases page.

Configure the package

Create a new JS file in your definitions/ folder create an SCD table with the following example:

const scd = require("dataform-scd");

scd("source_data_scd", {
  // A unique identifier for rows in the table.
  uniqueKey: "user_id",
  // A field that stores a timestamp or date of when the row was last changed.
  timestamp: "updated_at",
    // A field that stores the hash value of the fields that we want to track changes in. If you do not want to use the hash comparison, you may omit this field or set it to null
    hash: "hash_value", // OPTIONAL
    // The source table to build slowly changing dimensions from.
    source: {
      schema: "dataform_scd_example",
      name: "source_data",
  },
  // Any configuration parameters to apply to the incremental table that will be created.
  incrementalConfig: {
    bigquery: {
      partitionBy: "updated_at",
    },
  },
});

For more advanced customization of outputs, see the example.js.

Scheduling

Slowly changing dimensions can only by updated as quickly as these models are run. These models should typically be scheduled to run every day or every hour, depending on the granularity of changes you want to capture.

Hash comparison option

Depending on your data update method, you may want to use the hash field option to compare rows on each execution and only add the ones that have been changed or added. To do this, please make sure your table contains a hash field created using the hash function of your choice. You can find a list of the hash functions available in BigQuery here. On each incremental run, the query will compare the hashes for each unique identifier to the ones in the updated table. It will only keep the rows where the hash has changed or where the row ID is not found in the current data.

If you do not want to use the hash comparison, simply omit the hash parameter from the config file or set it to null. If you do this, all rows with an updated timestamp will be added to the {name}_updates table, even if the data did not otherwise change.

Data models

This package will create two relations in the warehouse, for a given name these will be:

  • {name} - a view with scd_valid_from and scd_valid_to fields
  • {name}_updates - an incremental table that stores the change history of the source table

dataform-scd's People

Contributors

lewish avatar polinabee avatar benbirt avatar ekrekr avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.