Comments (4)
Hi @mjkanji, my suspicion is that much of this may be stemming from a synchronization issue that can crop up when using a combination of "wait for all parents to updated" and "materialize on this cron schedule".
The first time this is condition is evaluated, if the dbt assets have been updated less recently than their parents, then they will execute immediately, regardless of if the fivetran assets have completed yet. Now, when the fivetran assets finish materializing, the dbt assets get sent back to the "parents updated more recently" state, and so the next day at midnight, they are instantly "ready to kick off" again, and so on.
We are working on implementing a "parents updated since latest cron schedule tick" type of rule to solve this issue, as it will force the downstream asset to wait for the upstream assets to have been materialized after midnight before kicking off (and so an upstream materialization from yesterday will not allow the downstream asset to be materialized "ahead of schedule", even if the parent did indeed materialize more recently than the child)
from dagster.
Hi @OwenKephart Thank you for the reply! Is there an estimated ETA for when the new rule will be released?
Additionally, while I think the interaction you mentioned can be part of the story, I don't think it's the entire story. That's because I have, on multiple occasions, run all DBT assets much later in the day, after the usual cron tick, as a single run. In this case, the Fivetran assets would have been materialized hours ago, so when the cron tick arrives the next day, the parents would have all been updated before the materialization of the DBT assets. Yet, the behaviour is seemingly the same and, additionally, the DBT assets are split across multiple runs.
There's also another peculiarity in my setup and I'm wondering if that may be causing some of this. The IOManager
I'm using for my DBT assets counts the number of rows in the table/view as part of the handle_output
method. I'm wondering if this is messing with the order of materialization times for assets.
For example, consider a setup where A -> B -> C
and
- A is a source asset (in DBT parlance) with 1B rows that's orchestrated by Dagster (i.e., outside of DBT).
- B is a DBT model/view that evaluates to
select * from A
. - C is a DBT model/view that evaluates to
select * from B limit 10
.
In this setup, the row count operation for C will terminate almost immediately, but counting 1B rows for B will take a while. Would this, in turn, mean that the asset materialization event (and time) for B is later than the materialization time for C, even though DBT would have correctly materialized B before C?
In essence, I'm wondering if the row count operation could cause the same issue that you're identifying with the Fivetran parents, but within the DBT group itself. Or is the materialization time recorded by Dagster dependent on when dbt run
command sends a SUCCESS
event, regardless of how long any processing by the IOManager takes afterwards?
from dagster.
In terms of next steps:
- Is there anything else you need from my end (e.g., logs, access) that might help you identify the root cause with more certainty?
- This orchestration issue is preventing me from putting Dagster into production on a well-past-overdue project for a client and I really need a short-term solution. What would you recommend doing to enable the ideal setup I outlined above: all the assets in the Fivetran-Assets and Data-Pipeline groups are updated first, and then all DBT assets are updated as part of a single run. I'm thinking the Fivetran-Assets and Data-Pipeline groups can remain on their current AMP setup, and I'll need to disable AMP for the DBT assets and use a
job
instead. If so, how can I ensure the job only starts after all of the parents have been updated? Would this be a job for aSensor
?- Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the
appwrap
group, but it needs to be run monthly, instead of daily, and only after the daily DBT run has been completed on the first of the month. Would I use a sensor for that as well?
- Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the
from dagster.
Hi @mjkanji -- the new rule will go out in either this week or next week's release.
In terms of a short-term value solution, considering the specific use case you have is fairly simple, I think the current-day solution would be to use a combination of a schedule (for the fivetran + data pipeline groups), and a run status sensor (for the dbt assets).
For the assets downstream of the dbt assets, that could also be accomplished with a sensor (which only fires if it's the first of the month).
Definitely interested in getting to the bottom of this issue, though. The strangest part to me here is the fact that the dbt assets are executing immediately upon the cron schedule ticking. Some useful information to help debug this would be a screenshot of the Automation tab on the Asset Details page of one of the triggered assets (preferably one of the root assets of the dbt project)
from dagster.
Related Issues (20)
- Auto-layout the asset lineage on the UI
- dbt groups do not update in the UI HOT 1
- gRPC Client does not resolve host name in AWS namespaces HOT 1
- in auto-materialize, handle asset observations with data times but no data versions
- feat: Support multi-line string in run_config HOT 1
- dagster-deltalake S3Config allow_unsafe_rename parameter ignored HOT 1
- `build_sensor_context` does not build a valid context for `asset_sensor` when the sensor has `Resources`
- error in view schedules and scheduled-runs HOT 6
- Dagster Daemon randomly crashes HOT 6
- Can't import DuckDBPolarsIOManager HOT 2
- Log rolling for compute logs based on size:
- gRPC server crashes when running an amd container on arm (mac m1)
- [pipes] problems with use in graph_asset
- Subsetting asset checks in graph_multi_assets
- Graph Asset more Asset-like HOT 1
- Ability to use setup_for_execution without context
- "Could not find schedule [...]", but cannot delete in OSS
- Auto-materialize wrong multi-partition HOT 2
- `DagsterInvalidDefinitionError` due to overfitting at Dagster/dbt interface
- Pex is forbidden - Dependency hell HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dagster.