Coder Social home page Coder Social logo

dbt-activity-schema's People

Contributors

cdussud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dbt-activity-schema's Issues

Add tiebreakers to Activity Occurrence & Activity Repeated At

With the current code, if the same customer performs the same activity multiple times at the same ts, the activity_occurrence and activity_repeated_at columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.

This behavior should have ~no impact on results, but does make equality testing between dev and prod environments painful via the audit-helper dbt package or data-diff framework since these two columns can change with each run.

I would expect that adding the activity_id to the order by would resolve this issue without introducing additional complications. Something like:

{# Creates the two activity occurrence columns: activity_occurrence and activity_repeated_at  #}

{% macro activity_occurrence() %}
    row_number() over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
            ) order by ts asc, activity_id asc) as activity_occurrence,
    lead(ts) over (
        partition by coalesce (
            {{ safe_cast("customer", type_string()) }},
            {{ safe_cast("anonymous_customer_id", type_string()) }}
        ) order by ts asc, activity_id asc) as activity_repeated_at
{% endmacro %}

What would be the best way for me to support resolving this issue?

Cannot partition output of make_activity (BigQuery)

I'm following the partition / clustering strategy in the implementation spec for BigQuery, which suggests I partition an activity table by ts:

{ config(
  partition_by={
    "field": "ts",
    "data_type": "timestamp",
    "granularity": "month"
  },
  cluster_by=['activity_occurrence', 'customer']
)}}
...
select * from {{ activity_schema.make_activity('deposited_payroll') }}

However, I'm getting this error:
image

Which I think has to do with this order by statement at the end of the macro: https://github.com/narratorai/dbt-activity-schema/blob/main/macros/make_activity.sql#L23

I think removing that might do the trick, assuming the order by isn't important for other databases?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.