narratorai / dbt-activity-schema Goto Github PK
View Code? Open in Web Editor NEWActivity Schema dbt package
License: Apache License 2.0
Activity Schema dbt package
License: Apache License 2.0
With the current code, if the same customer
performs the same activity multiple times at the same ts
, the activity_occurrence
and activity_repeated_at
columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.
This behavior should have ~no impact on results, but does make equality testing between dev
and prod
environments painful via the audit-helper
dbt package or data-diff
framework since these two columns can change with each run.
I would expect that adding the activity_id
to the order by would resolve this issue without introducing additional complications. Something like:
{# Creates the two activity occurrence columns: activity_occurrence and activity_repeated_at #}
{% macro activity_occurrence() %}
row_number() over (
partition by coalesce (
{{ safe_cast("customer", type_string()) }},
{{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc, activity_id asc) as activity_occurrence,
lead(ts) over (
partition by coalesce (
{{ safe_cast("customer", type_string()) }},
{{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc, activity_id asc) as activity_repeated_at
{% endmacro %}
What would be the best way for me to support resolving this issue?
I'm following the partition / clustering strategy in the implementation spec for BigQuery, which suggests I partition an activity table by ts
:
{ config(
partition_by={
"field": "ts",
"data_type": "timestamp",
"granularity": "month"
},
cluster_by=['activity_occurrence', 'customer']
)}}
...
select * from {{ activity_schema.make_activity('deposited_payroll') }}
However, I'm getting this error:
Which I think has to do with this order by statement at the end of the macro: https://github.com/narratorai/dbt-activity-schema/blob/main/macros/make_activity.sql#L23
I think removing that might do the trick, assuming the order by isn't important for other databases?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.