narratorai / dbt-activity-schema Goto Github PK

14.0 2.0 3.0 9 KB

Activity Schema dbt package

License: Apache License 2.0

dbt-activity-schema's Introduction

Activity Schema

This dbt package contains macros for building an activity schema.

The macros can help write models for use directly as activity stream tables, particularly if they need to be warehouse-independent.

It follows the v2.0 version of the activity schema specification.

Installation

Check dbt Hub for the latest installation instructions, or read the dbt docs for more information on installing packages.

Include this in your packages.yml

packages:
  - package: narratorai/activity_schema
    version: [">=0.1.0", "<0.2.0"]

Usage

Call the make_activity macro in your models to get the feature_json and activity occurrence columns.

{{ config(features=['subject', 'content']) }}

with final as (
  ...
)

select * from {{ activity_schema.make_activity('final') }}

Warehouse Support

Works on the following warehouses

Bigquery
Postgres
Redshift
Snowflake

Macros

make_activity (source) This macro takes a cte name and adds the feature json and activity occurrence columns to it.

To properly add the feature_json column it takes a list of feature column names as a configuration variable.

Example usage

{{ config(features=['tag', 'subject']) }}

with final as (
  select 
    id as activity_id,
    created_at as ts ,

    'received_email' as activity,

    email as customer,
    null as anonymous_customer_id,

    subject,
    preview as content,

    null as link,
    null as revenue_impact
  from emails
)

select * from {{ activity_schema.make_activity('final') }}

feature_json (source)

Helps build a warehouse-independent feature_json column. This works by taking a list of columns containing the feature values and selecting them together into a single json object.

This isn't really necessary if your model only targets a single warehouse. It might be easier to simply write your CTE with a feature_json directly, like so (e.g. for Redshift)

object('subject', subject, 'content', preview ) as feature_json,

activity_occurrence (source) Builds the two activity occurrence columns, activity_occurrence and activity_repeated_at. These are used in the querying of an activity stream to efficiently query Nth and last activities for a given customer.

Resources:

Activity schema github page
Discussion on the activity schema dbt slack
Narrator - a service for maintaining & querying activity schemas

dbt-activity-schema's People

Contributors

Stargazers

Watchers

Forkers

suelai wylbee nipunj15

dbt-activity-schema's Issues

Add tiebreakers to Activity Occurrence & Activity Repeated At

With the current code, if the same customer performs the same activity multiple times at the same ts, the activity_occurrence and activity_repeated_at columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.

This behavior should have ~no impact on results, but does make equality testing between dev and prod environments painful via the audit-helper dbt package or data-diff framework since these two columns can change with each run.

I would expect that adding the activity_id to the order by would resolve this issue without introducing additional complications. Something like:

{# Creates the two activity occurrence columns: activity_occurrence and activity_repeated_at  #}

{% macro activity_occurrence() %}
    row_number() over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
            ) order by ts asc, activity_id asc) as activity_occurrence,
    lead(ts) over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
        ) order by ts asc, activity_id asc) as activity_repeated_at
{% endmacro %}

What would be the best way for me to support resolving this issue?

activity_occurrence is leaving out activity in partition, why?

Just started diving in to the Activity Schema pattern and am simultaneously using this dbt-activity-schema and the ActivitySchema repo as reference.

But when trying out the dbt macros I couldn't get anything useful out using the window aggregates (activity_occurrence and activity_repeated_at). Or at least not assuming that they should reflect the ordinal ordering of events for a specific customer AND activity.

Just take the pseudo-code for these metrics from the ActivitySchema implementation specification: occurrence-columns calculation. Here the field activity is used in the partitioning.

row_number() over (partition by coalesce (activity, customer, anonymous_customer_id) order by ts asc) as activity_occurrence,
lead(ts) over (partition by coalesce (activity, customer, anonymous_customer_id) order by ts asc) as activity_repeated_at

But in this repo's implementation there is only partitioning done using customer:

# macros/activity_occurrence.sql
...
partition by coalesce (
    {{ safe_cast("customer", type_string()) }},
    {{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc ) as activity_occurrence,
...

So they contradict, and in my head I can't see the value in just having an ordinal numbering of all events for a customer. Is this diff something you did intentionally?

Cannot partition output of make_activity (BigQuery)

I'm following the partition / clustering strategy in the implementation spec for BigQuery, which suggests I partition an activity table by ts:

{ config(
  partition_by={
    "field": "ts",
    "data_type": "timestamp",
    "granularity": "month"
  },
  cluster_by=['activity_occurrence', 'customer']
)}}
...
select * from {{ activity_schema.make_activity('deposited_payroll') }}

However, I'm getting this error:

Which I think has to do with this order by statement at the end of the macro: https://github.com/narratorai/dbt-activity-schema/blob/main/macros/make_activity.sql#L23

I think removing that might do the trick, assuming the order by isn't important for other databases?

narratorai / dbt-activity-schema Goto Github PK

dbt-activity-schema's Introduction

Activity Schema

Installation

Usage

Warehouse Support

Macros

Resources:

dbt-activity-schema's People

Contributors

Stargazers

Watchers

Forkers

dbt-activity-schema's Issues

Add tiebreakers to Activity Occurrence & Activity Repeated At

activity_occurrence is leaving out activity in partition, why?

Cannot partition output of make_activity (BigQuery)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent