Coder Social home page Coder Social logo

Comments (7)

mikealfare avatar mikealfare commented on July 30, 2024 1

Thanks for the submission @cmcnicoll! I think the two things you're comparing might not be exactly the same.

When a dbt run is executed with materialized views, a few things happen that lead up to a state changing command being issued to the server. At a high level, it looks something like this:

  • Query the database to get a list of all objects in the target schema along with their types (table, view, materialized view)
  • Query the database to build out the dependency tree for ordering nodes in the model graph
  • For each model (materialized view in this case):
    • Does the object exist?
      • No > Create the materialized view
      • Yes > Is the existing object a materialized view?
        • No > Drop the existing object then create the materialized view
        • Yes > Both objects are materialized views; do they have the same configuration?
          • No > Apply updates via alter or apply a full refresh (depending on the type of configuration change)
          • Yes > Refresh the materialized view

In order for dbt to know that all that's needed is a refresh statement, it needs to do this investigative work first. When you ran the refresh statement in DBeaver, you already subconsciously observed that your project structure has not changed, nor did the model definition or configuration change since the last run. Hence executing a refresh statement is the only required step. In other words, for dbt run the stopwatch started at the first line in the logic above; whereas for manual execution in DBeaver the stopwatch didn't start until the last line in the logic flow above.

I agree that the 7x-8x jump is larger than I would have expected. I'll talk through my thought process here and see where we wind up. When working with test sets, oftentimes the bottleneck is not the size of the data, the local application nor the server, it's the network time in getting your command to the server. I'm going to assume that the network lag is somewhere around the 4s that you're seeing for the refresh command to return since all the server did was determine that nothing needed to happen. That means a query costs a minimum of 4s. Walking through the logic flow above, we need to run the following queries (at least):

  • Caching query
  • Object dependency query
  • Materialized view configuration query
  • Refresh query

Let's assume I'm not forgetting anything (not a safe assumption). And I'll point out that these four queries run in series. That puts us at 17.5s (4.371s x 4) just in network lag time, leaving 15s (32.5s - 17.5s) for query execution of the first three queries (assume refresh == 0s). That feels a little higher than expected, but not alarming, especially considering that we have not accounted for any dbt work yet.

At the moment, I'm not inclined to regard this as a bug. Hopefully this context helps. I'll leave this open in case you disagree though.

from dbt-redshift.

mikealfare avatar mikealfare commented on July 30, 2024 1
1. Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.

In your post you have auto_refresh: false. Have you considered setting this to true? This would allow Redshift to refresh on a schedule, which should behave similarly to manually running the refresh statement (minus the network lag).

2. Creating/refreshing materialized views do not run concurrently compared to using `materialized="table"`.

This is interesting. I'd like to look into why that's the case as it should run in parallel if there are no dependencies between the materialized views.

3. Using materialized views for near real-time reporting might not be possible yet, considering that `materialized_view` is a new feature.

Could you elaborate? Is this due to the time it takes to refresh, or something else?

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

If 2 is true, I would argue that's a bug. The intention is to run as many models in parallel as we can.

I could see speeding up the investigative queries as being a performance enhancement down the road. Ideally that happens at the same time as building the cache. However, since only materialized views support on_configuration_change at the moment, we would impact everyone for a significant subset of users. Once we implement on_configuration_change for views and tables, then I think it makes sense to move that up in the flow, which would increase performance.

If your intention is to bypass the investigative queries, I strongly suggest looking into turning on auto_refresh. Materialized views work more like views than tables. The only reason you would re-run a view in dbt is because the logic or configuration has changed. Hence dbt run is effectively a deploy step for views. By comparison, you would run a table whenever you want to deploy OR update the data in the table. The same is true of incremental or snapshot models. A materialized view works more like a view, but the data needs to be updated in it like an incremental model. The difference is that incremental models require a dbt run. The benefit of a materialized view is that you can tell the server to do the refresh step for you using auto_refresh. In that context, I would look at dbt run for materialized views as a deployment only step. It's not an efficient way to execute a refresh statement if that's all that's needed (as you've identified here).

from dbt-redshift.

mikealfare avatar mikealfare commented on July 30, 2024 1

Ah ok, the context helps a lot. Agreed that there are some significant limitations on materialized views, and they vary by platform. Based on your context and use cases, would it be fair to reclassify this into two enhancements? Namely:

  • Speed up investigative queries (perhaps by tying into the initial cache build)
  • Add configuration that would allow dbt to attempt a refresh first, defaulted to "no", and upon failure resume the existing workflow

Cc: @Fleid

from dbt-redshift.

cmcnicoll avatar cmcnicoll commented on July 30, 2024

Thanks @mikealfare for the context! I appreciate you taking the time to write that out.

I wasn't sure if this should be a bug, feature or discussion. The main drivers for creating this issue are:

  1. Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.
  2. Creating/refreshing materialized views do not run concurrently compared to using materialized="table".
  3. Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

Or should this be moved to a discussion? I can expand on my use case, if that helps.

from dbt-redshift.

cmcnicoll avatar cmcnicoll commented on July 30, 2024

Thanks again @mikealfare for the detailed response! Here is a quick overview of my use case.

  • Package config:
+auto_refresh: false
+backup: true
+materialized: materialized_view
  • Union identically-structured data sources:
+auto_refresh: true
+tags:
  - mv_auto_refresh
Auto-refresh is not supported for materialized views defined on other materialized views.
  • Staging/Intermediate/Core:
    • Use an incremental model when query limitations prevent use of an incremental materialized view.

And here are some timings from elementary-data/elementary:

  • Create materialized views: median_execution_time = 22.7
  • Refresh materialized views (--exclude tag:mv_auto_refresh): median_execution_time = 21.2
  • Drop dev schema and try +materialized: table instead: median_execution_time = 8.4
  1. Creating/refreshing materialized views do not run concurrently compared to using materialized="table".

Sorry, this is not true. When I check dbt.log, the investigative queries are indeed running in parallel. Each materialized view taking ~20s to complete with a ~4s delay between makes it seem like they are running in series.

  1. Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

I was hoping that chaining incremental materialized views together would handle slowly changing dimensions and would be faster than incremental models. It seems like the best use case for incremental materialized views is for source models with auto_refresh on. Speeding up the investigative queries would allow incremental materialized views to compete with the speed of incremental models, which would help my use case.

from dbt-redshift.

cmcnicoll avatar cmcnicoll commented on July 30, 2024

Yes, that sounds good to me! Thanks!

from dbt-redshift.

cmcnicoll avatar cmcnicoll commented on July 30, 2024

We are currently running hourly refreshes, but we are still interested in using incremental materialized views to do near real time refreshes. Thanks!

from dbt-redshift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.