Coder Social home page Coder Social logo

Comments (4)

OwenKephart avatar OwenKephart commented on May 30, 2024

Hi @mjkanji, my suspicion is that much of this may be stemming from a synchronization issue that can crop up when using a combination of "wait for all parents to updated" and "materialize on this cron schedule".

The first time this is condition is evaluated, if the dbt assets have been updated less recently than their parents, then they will execute immediately, regardless of if the fivetran assets have completed yet. Now, when the fivetran assets finish materializing, the dbt assets get sent back to the "parents updated more recently" state, and so the next day at midnight, they are instantly "ready to kick off" again, and so on.

We are working on implementing a "parents updated since latest cron schedule tick" type of rule to solve this issue, as it will force the downstream asset to wait for the upstream assets to have been materialized after midnight before kicking off (and so an upstream materialization from yesterday will not allow the downstream asset to be materialized "ahead of schedule", even if the parent did indeed materialize more recently than the child)

from dagster.

mjkanji avatar mjkanji commented on May 30, 2024

Hi @OwenKephart Thank you for the reply! Is there an estimated ETA for when the new rule will be released?

Additionally, while I think the interaction you mentioned can be part of the story, I don't think it's the entire story. That's because I have, on multiple occasions, run all DBT assets much later in the day, after the usual cron tick, as a single run. In this case, the Fivetran assets would have been materialized hours ago, so when the cron tick arrives the next day, the parents would have all been updated before the materialization of the DBT assets. Yet, the behaviour is seemingly the same and, additionally, the DBT assets are split across multiple runs.

There's also another peculiarity in my setup and I'm wondering if that may be causing some of this. The IOManager I'm using for my DBT assets counts the number of rows in the table/view as part of the handle_output method. I'm wondering if this is messing with the order of materialization times for assets.

For example, consider a setup where A -> B -> C and

  • A is a source asset (in DBT parlance) with 1B rows that's orchestrated by Dagster (i.e., outside of DBT).
  • B is a DBT model/view that evaluates to select * from A.
  • C is a DBT model/view that evaluates to select * from B limit 10.

In this setup, the row count operation for C will terminate almost immediately, but counting 1B rows for B will take a while. Would this, in turn, mean that the asset materialization event (and time) for B is later than the materialization time for C, even though DBT would have correctly materialized B before C?

In essence, I'm wondering if the row count operation could cause the same issue that you're identifying with the Fivetran parents, but within the DBT group itself. Or is the materialization time recorded by Dagster dependent on when dbt run command sends a SUCCESS event, regardless of how long any processing by the IOManager takes afterwards?

from dagster.

mjkanji avatar mjkanji commented on May 30, 2024

In terms of next steps:

  • Is there anything else you need from my end (e.g., logs, access) that might help you identify the root cause with more certainty?
  • This orchestration issue is preventing me from putting Dagster into production on a well-past-overdue project for a client and I really need a short-term solution. What would you recommend doing to enable the ideal setup I outlined above: all the assets in the Fivetran-Assets and Data-Pipeline groups are updated first, and then all DBT assets are updated as part of a single run. I'm thinking the Fivetran-Assets and Data-Pipeline groups can remain on their current AMP setup, and I'll need to disable AMP for the DBT assets and use a job instead. If so, how can I ensure the job only starts after all of the parents have been updated? Would this be a job for a Sensor?
    • Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the appwrap group, but it needs to be run monthly, instead of daily, and only after the daily DBT run has been completed on the first of the month. Would I use a sensor for that as well?

from dagster.

OwenKephart avatar OwenKephart commented on May 30, 2024

Hi @mjkanji -- the new rule will go out in either this week or next week's release.

In terms of a short-term value solution, considering the specific use case you have is fairly simple, I think the current-day solution would be to use a combination of a schedule (for the fivetran + data pipeline groups), and a run status sensor (for the dbt assets).

For the assets downstream of the dbt assets, that could also be accomplished with a sensor (which only fires if it's the first of the month).

Definitely interested in getting to the bottom of this issue, though. The strangest part to me here is the fact that the dbt assets are executing immediately upon the cron schedule ticking. Some useful information to help debug this would be a screenshot of the Automation tab on the Asset Details page of one of the triggered assets (preferably one of the root assets of the dbt project)

from dagster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.