Is this a new bug in dbt-redshift? <li class="ta

Thanks for the submission <a class="user-mention notranslate" data-hovercard-type="use

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks again <a class="user-mention notranslate" data-hovercard-type="user" data-hover

[ADAP-1040] Refreshing incremental materialized views takes longer than expected about dbt-redshift HOT 7 OPEN

cmcnicoll commented on July 30, 2024

[ADAP-1040] Refreshing incremental materialized views takes longer than expected

from dbt-redshift.

Comments (7)

mikealfare commented on July 30, 2024 1

Thanks for the submission @cmcnicoll! I think the two things you're comparing might not be exactly the same.

When a dbt run is executed with materialized views, a few things happen that lead up to a state changing command being issued to the server. At a high level, it looks something like this:

Query the database to get a list of all objects in the target schema along with their types (table, view, materialized view)
Query the database to build out the dependency tree for ordering nodes in the model graph
For each model (materialized view in this case):
- Does the object exist?
  - No > Create the materialized view
  - Yes > Is the existing object a materialized view?
    - No > Drop the existing object then create the materialized view
    - Yes > Both objects are materialized views; do they have the same configuration?
      - No > Apply updates via alter or apply a full refresh (depending on the type of configuration change)
      - Yes > Refresh the materialized view

In order for dbt to know that all that's needed is a refresh statement, it needs to do this investigative work first. When you ran the refresh statement in DBeaver, you already subconsciously observed that your project structure has not changed, nor did the model definition or configuration change since the last run. Hence executing a refresh statement is the only required step. In other words, for dbt run the stopwatch started at the first line in the logic above; whereas for manual execution in DBeaver the stopwatch didn't start until the last line in the logic flow above.

I agree that the 7x-8x jump is larger than I would have expected. I'll talk through my thought process here and see where we wind up. When working with test sets, oftentimes the bottleneck is not the size of the data, the local application nor the server, it's the network time in getting your command to the server. I'm going to assume that the network lag is somewhere around the 4s that you're seeing for the refresh command to return since all the server did was determine that nothing needed to happen. That means a query costs a minimum of 4s. Walking through the logic flow above, we need to run the following queries (at least):

Caching query
Object dependency query
Materialized view configuration query
Refresh query

Let's assume I'm not forgetting anything (not a safe assumption). And I'll point out that these four queries run in series. That puts us at 17.5s (4.371s x 4) just in network lag time, leaving 15s (32.5s - 17.5s) for query execution of the first three queries (assume refresh == 0s). That feels a little higher than expected, but not alarming, especially considering that we have not accounted for any dbt work yet.

At the moment, I'm not inclined to regard this as a bug. Hopefully this context helps. I'll leave this open in case you disagree though.

from dbt-redshift.

mikealfare commented on July 30, 2024 1

1. Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.

In your post you have auto_refresh: false. Have you considered setting this to true? This would allow Redshift to refresh on a schedule, which should behave similarly to manually running the refresh statement (minus the network lag).

2. Creating/refreshing materialized views do not run concurrently compared to using `materialized="table"`.

This is interesting. I'd like to look into why that's the case as it should run in parallel if there are no dependencies between the materialized views.

3. Using materialized views for near real-time reporting might not be possible yet, considering that `materialized_view` is a new feature.

Could you elaborate? Is this due to the time it takes to refresh, or something else?

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

If 2 is true, I would argue that's a bug. The intention is to run as many models in parallel as we can.

I could see speeding up the investigative queries as being a performance enhancement down the road. Ideally that happens at the same time as building the cache. However, since only materialized views support on_configuration_change at the moment, we would impact everyone for a significant subset of users. Once we implement on_configuration_change for views and tables, then I think it makes sense to move that up in the flow, which would increase performance.

If your intention is to bypass the investigative queries, I strongly suggest looking into turning on auto_refresh. Materialized views work more like views than tables. The only reason you would re-run a view in dbt is because the logic or configuration has changed. Hence dbt run is effectively a deploy step for views. By comparison, you would run a table whenever you want to deploy OR update the data in the table. The same is true of incremental or snapshot models. A materialized view works more like a view, but the data needs to be updated in it like an incremental model. The difference is that incremental models require a dbt run. The benefit of a materialized view is that you can tell the server to do the refresh step for you using auto_refresh. In that context, I would look at dbt run for materialized views as a deployment only step. It's not an efficient way to execute a refresh statement if that's all that's needed (as you've identified here).

from dbt-redshift.

mikealfare commented on July 30, 2024 1

Ah ok, the context helps a lot. Agreed that there are some significant limitations on materialized views, and they vary by platform. Based on your context and use cases, would it be fair to reclassify this into two enhancements? Namely:

Speed up investigative queries (perhaps by tying into the initial cache build)
Add configuration that would allow dbt to attempt a refresh first, defaulted to "no", and upon failure resume the existing workflow

Cc: @Fleid

from dbt-redshift.

cmcnicoll commented on July 30, 2024

Thanks @mikealfare for the context! I appreciate you taking the time to write that out.

I wasn't sure if this should be a bug, feature or discussion. The main drivers for creating this issue are:

Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.
Creating/refreshing materialized views do not run concurrently compared to using materialized="table".
Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

Or should this be moved to a discussion? I can expand on my use case, if that helps.

from dbt-redshift.

cmcnicoll commented on July 30, 2024

Thanks again @mikealfare for the detailed response! Here is a quick overview of my use case.

Package config:

+auto_refresh: false
+backup: true
+materialized: materialized_view

Union identically-structured data sources:

+auto_refresh: true
+tags:
  - mv_auto_refresh

Base:
- Cannot use auto_refresh because of these limitations:

Auto-refresh is not supported for materialized views defined on other materialized views.

Staging/Intermediate/Core:
- Use an incremental model when query limitations prevent use of an incremental materialized view.

And here are some timings from elementary-data/elementary:

Create materialized views: median_execution_time = 22.7
Refresh materialized views (--exclude tag:mv_auto_refresh): median_execution_time = 21.2
Drop dev schema and try +materialized: table instead: median_execution_time = 8.4

Creating/refreshing materialized views do not run concurrently compared to using materialized="table".

Sorry, this is not true. When I check dbt.log, the investigative queries are indeed running in parallel. Each materialized view taking ~20s to complete with a ~4s delay between makes it seem like they are running in series.

Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

I was hoping that chaining incremental materialized views together would handle slowly changing dimensions and would be faster than incremental models. It seems like the best use case for incremental materialized views is for source models with auto_refresh on. Speeding up the investigative queries would allow incremental materialized views to compete with the speed of incremental models, which would help my use case.

from dbt-redshift.

cmcnicoll commented on July 30, 2024

Yes, that sounds good to me! Thanks!

from dbt-redshift.

cmcnicoll commented on July 30, 2024

We are currently running hourly refreshes, but we are still interested in using incremental materialized views to do near real time refreshes. Thanks!

from dbt-redshift.

[ADAP-1040] Refreshing incremental materialized views takes longer than expected about dbt-redshift HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent