Coder Social home page Coder Social logo

dbt-activity-schema's Introduction

Activity Schema

This dbt package contains macros for building an activity schema.

The macros can help write models for use directly as activity stream tables, particularly if they need to be warehouse-independent.

It follows the v2.0 version of the activity schema specification.

Installation

Check dbt Hub for the latest installation instructions, or read the dbt docs for more information on installing packages.

Include this in your packages.yml

packages:
  - package: narratorai/activity_schema
    version: [">=0.1.0", "<0.2.0"]

Usage

Call the make_activity macro in your models to get the feature_json and activity occurrence columns.

{{ config(features=['subject', 'content']) }}

with final as (
  ...
)

select * from {{ activity_schema.make_activity('final') }}

Warehouse Support

Works on the following warehouses

  • Bigquery
  • Postgres
  • Redshift
  • Snowflake

Macros

make_activity (source) This macro takes a cte name and adds the feature json and activity occurrence columns to it.

To properly add the feature_json column it takes a list of feature column names as a configuration variable.

Example usage

{{ config(features=['tag', 'subject']) }}

with final as (
  select 
    id as activity_id,
    created_at as ts ,

    'received_email' as activity,

    email as customer,
    null as anonymous_customer_id,

    subject,
    preview as content,

    null as link,
    null as revenue_impact
  from emails
)

select * from {{ activity_schema.make_activity('final') }}

feature_json (source)

Helps build a warehouse-independent feature_json column. This works by taking a list of columns containing the feature values and selecting them together into a single json object.

This isn't really necessary if your model only targets a single warehouse. It might be easier to simply write your CTE with a feature_json directly, like so (e.g. for Redshift)

object('subject', subject, 'content', preview ) as feature_json,

activity_occurrence (source) Builds the two activity occurrence columns, activity_occurrence and activity_repeated_at. These are used in the querying of an activity stream to efficiently query Nth and last activities for a given customer.


Resources:

dbt-activity-schema's People

Contributors

cdussud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dbt-activity-schema's Issues

Add tiebreakers to Activity Occurrence & Activity Repeated At

With the current code, if the same customer performs the same activity multiple times at the same ts, the activity_occurrence and activity_repeated_at columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.

This behavior should have ~no impact on results, but does make equality testing between dev and prod environments painful via the audit-helper dbt package or data-diff framework since these two columns can change with each run.

I would expect that adding the activity_id to the order by would resolve this issue without introducing additional complications. Something like:

{# Creates the two activity occurrence columns: activity_occurrence and activity_repeated_at  #}

{% macro activity_occurrence() %}
    row_number() over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
            ) order by ts asc, activity_id asc) as activity_occurrence,
    lead(ts) over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
        ) order by ts asc, activity_id asc) as activity_repeated_at
{% endmacro %}

What would be the best way for me to support resolving this issue?

activity_occurrence is leaving out activity in partition, why?

Just started diving in to the Activity Schema pattern and am simultaneously using this dbt-activity-schema and the ActivitySchema repo as reference.

But when trying out the dbt macros I couldn't get anything useful out using the window aggregates (activity_occurrence and activity_repeated_at). Or at least not assuming that they should reflect the ordinal ordering of events for a specific customer AND activity.

Just take the pseudo-code for these metrics from the ActivitySchema implementation specification: occurrence-columns calculation. Here the field activity is used in the partitioning.

row_number() over (partition by coalesce (activity, customer, anonymous_customer_id) order by ts asc) as activity_occurrence,
lead(ts) over (partition by coalesce (activity, customer, anonymous_customer_id) order by ts asc) as activity_repeated_at

But in this repo's implementation there is only partitioning done using customer:

# macros/activity_occurrence.sql
...
partition by coalesce (
    {{ safe_cast("customer", type_string()) }},
    {{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc ) as activity_occurrence,
...

So they contradict, and in my head I can't see the value in just having an ordinal numbering of all events for a customer. Is this diff something you did intentionally?

Cannot partition output of make_activity (BigQuery)

I'm following the partition / clustering strategy in the implementation spec for BigQuery, which suggests I partition an activity table by ts:

{ config(
  partition_by={
    "field": "ts",
    "data_type": "timestamp",
    "granularity": "month"
  },
  cluster_by=['activity_occurrence', 'customer']
)}}
...
select * from {{ activity_schema.make_activity('deposited_payroll') }}

However, I'm getting this error:
image

Which I think has to do with this order by statement at the end of the macro: https://github.com/narratorai/dbt-activity-schema/blob/main/macros/make_activity.sql#L23

I think removing that might do the trick, assuming the order by isn't important for other databases?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.