Comments (7)
Thanks for the submission @cmcnicoll! I think the two things you're comparing might not be exactly the same.
When a dbt run
is executed with materialized views, a few things happen that lead up to a state changing command
being issued to the server. At a high level, it looks something like this:
- Query the database to get a list of all objects in the target schema along with their types (table, view, materialized view)
- Query the database to build out the dependency tree for ordering nodes in the model graph
- For each model (materialized view in this case):
- Does the object exist?
- No >
Create the materialized view
- Yes > Is the existing object a materialized view?
- No >
Drop the existing object then create the materialized view
- Yes > Both objects are materialized views; do they have the same configuration?
- No >
Apply updates via alter or apply a full refresh
(depending on the type of configuration change) - Yes >
Refresh the materialized view
- No >
- No >
- No >
- Does the object exist?
In order for dbt to know that all that's needed is a refresh statement, it needs to do this investigative work first. When you ran the refresh statement in DBeaver, you already subconsciously observed that your project structure has not changed, nor did the model definition or configuration change since the last run. Hence executing a refresh statement is the only required step. In other words, for dbt run
the stopwatch started at the first line in the logic above; whereas for manual execution in DBeaver the stopwatch didn't start until the last line in the logic flow above.
I agree that the 7x-8x jump is larger than I would have expected. I'll talk through my thought process here and see where we wind up. When working with test sets, oftentimes the bottleneck is not the size of the data, the local application nor the server, it's the network time in getting your command to the server. I'm going to assume that the network lag is somewhere around the 4s that you're seeing for the refresh command to return since all the server did was determine that nothing needed to happen. That means a query costs a minimum of 4s. Walking through the logic flow above, we need to run the following queries (at least):
- Caching query
- Object dependency query
- Materialized view configuration query
- Refresh query
Let's assume I'm not forgetting anything (not a safe assumption). And I'll point out that these four queries run in series. That puts us at 17.5s (4.371s x 4) just in network lag time, leaving 15s (32.5s - 17.5s) for query execution of the first three queries (assume refresh == 0s). That feels a little higher than expected, but not alarming, especially considering that we have not accounted for any dbt work yet.
At the moment, I'm not inclined to regard this as a bug. Hopefully this context helps. I'll leave this open in case you disagree though.
from dbt-redshift.
1. Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.
In your post you have auto_refresh: false
. Have you considered setting this to true
? This would allow Redshift to refresh on a schedule, which should behave similarly to manually running the refresh statement (minus the network lag).
2. Creating/refreshing materialized views do not run concurrently compared to using `materialized="table"`.
This is interesting. I'd like to look into why that's the case as it should run in parallel if there are no dependencies between the materialized views.
3. Using materialized views for near real-time reporting might not be possible yet, considering that `materialized_view` is a new feature.
Could you elaborate? Is this due to the time it takes to refresh, or something else?
Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.
If 2 is true, I would argue that's a bug. The intention is to run as many models in parallel as we can.
I could see speeding up the investigative queries as being a performance enhancement down the road. Ideally that happens at the same time as building the cache. However, since only materialized views support on_configuration_change
at the moment, we would impact everyone for a significant subset of users. Once we implement on_configuration_change
for views and tables, then I think it makes sense to move that up in the flow, which would increase performance.
If your intention is to bypass the investigative queries, I strongly suggest looking into turning on auto_refresh
. Materialized views work more like views than tables. The only reason you would re-run a view in dbt is because the logic or configuration has changed. Hence dbt run
is effectively a deploy step for views. By comparison, you would run a table whenever you want to deploy OR update the data in the table. The same is true of incremental or snapshot models. A materialized view works more like a view, but the data needs to be updated in it like an incremental model. The difference is that incremental models require a dbt run. The benefit of a materialized view is that you can tell the server to do the refresh step for you using auto_refresh
. In that context, I would look at dbt run
for materialized views as a deployment only step. It's not an efficient way to execute a refresh statement if that's all that's needed (as you've identified here).
from dbt-redshift.
Ah ok, the context helps a lot. Agreed that there are some significant limitations on materialized views, and they vary by platform. Based on your context and use cases, would it be fair to reclassify this into two enhancements? Namely:
- Speed up investigative queries (perhaps by tying into the initial cache build)
- Add configuration that would allow dbt to attempt a refresh first, defaulted to "no", and upon failure resume the existing workflow
Cc: @Fleid
from dbt-redshift.
Thanks @mikealfare for the context! I appreciate you taking the time to write that out.
I wasn't sure if this should be a bug, feature or discussion. The main drivers for creating this issue are:
- Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.
- Creating/refreshing materialized views do not run concurrently compared to using
materialized="table"
. - Using materialized views for near real-time reporting might not be possible yet, considering that
materialized_view
is a new feature.
Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.
Or should this be moved to a discussion? I can expand on my use case, if that helps.
from dbt-redshift.
Thanks again @mikealfare for the detailed response! Here is a quick overview of my use case.
- Package config:
+auto_refresh: false
+backup: true
+materialized: materialized_view
- Union identically-structured data sources:
+auto_refresh: true
+tags:
- mv_auto_refresh
- Base:
- Cannot use
auto_refresh
because of these limitations:
- Cannot use
Auto-refresh is not supported for materialized views defined on other materialized views.
- Staging/Intermediate/Core:
- Use an incremental model when query limitations prevent use of an incremental materialized view.
And here are some timings from elementary-data/elementary
:
- Create materialized views:
median_execution_time = 22.7
- Refresh materialized views (
--exclude tag:mv_auto_refresh
):median_execution_time = 21.2
- Drop dev schema and try
+materialized: table
instead:median_execution_time = 8.4
- Creating/refreshing materialized views do not run concurrently compared to using
materialized="table"
.
Sorry, this is not true. When I check dbt.log
, the investigative queries are indeed running in parallel. Each materialized view taking ~20s to complete with a ~4s delay between makes it seem like they are running in series.
- Using materialized views for near real-time reporting might not be possible yet, considering that
materialized_view
is a new feature.
I was hoping that chaining incremental materialized views together would handle slowly changing dimensions and would be faster than incremental models. It seems like the best use case for incremental materialized views is for source
models with auto_refresh
on. Speeding up the investigative queries would allow incremental materialized views to compete with the speed of incremental models, which would help my use case.
from dbt-redshift.
Yes, that sounds good to me! Thanks!
from dbt-redshift.
We are currently running hourly refreshes, but we are still interested in using incremental materialized views to do near real time refreshes. Thanks!
from dbt-redshift.
Related Issues (20)
- Import relevant pytest(s) for cross-database `cast` macro
- [Feature] DBT Unit tests to support Redshift limitations such as LISTAGG, MEDIAN, PERCENTILE_CONT, etc. HOT 2
- Cross-database `date` macro
- [Feature] Remove `dbt-postgres` dependency HOT 3
- [Bug] Unit test infers the wrong data type when the first record has some NULL values HOT 15
- [Feature] Add support for federated query external schemas HOT 3
- [Bug] The `--empty` argument doesn't work on models with table aliases HOT 1
- [Bug] dbt-redshift should not use `group_federation=true` when connecting with Serverless.
- [Bug] `persist_docs` not working properly with materialized views HOT 3
- [Bug] Materialized view config change identification not working with multiple sort keys
- [Bug] psycopg2 dependency is not installed correctly HOT 4
- [Authentication] Support for IAM Roles via inline profile parameters
- On serverless, authentication impacts DDL generation (database vs IAM) HOT 1
- [Feature] Update IAM auth to warn that `user` field is ignored
- [Bug] Connection PID Query Causes unnecessary Serverless Billing HOT 1
- [Bug] Default auto_begin causes unnecessary Serverless Billing HOT 1
- [Feature] Optimize for Serverless HOT 1
- Update the build check workflow
- [Feature] Add TRUNCATE INSERT to materializations list HOT 2
- [Bug] Materlialized views being incorrectly identified when owned by another user HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-redshift.