Coder Social home page Coder Social logo

Comments (2)

dataders avatar dataders commented on July 30, 2024 1

Hi @lucapazz, thanks so much for taking the time to document what you're seeing. That table is a work of art!

Within dbt Core and adapters, we haven't ever invested in performance testing or benchmarking. If I could wave a magic wand and introduce performance gating as part of our CI, I would in a heartbeat. This is certainly something that we could consider as a future investment.

Below I'll try to share as much context as I can, please tell me if this is helpful or where you disagree.

end user impact on performance

Jaffle Shop is exactly the project I'd reach for to do this benchmark. However, my gut instinct is that in "real-world" dbt projects in production, the majority of the time comes from the queries in the data warehouse itself. This thought it shared in recent comments a post in r/dataengineering about a project that is dbt but 30X faster.

Before we get into some of the contributors to performance degradation, I'm curious to understand how you feel the "performance degradation" of upgrading from 1.3 to 1.7 would impact your team. Most importantly, how often do you or your team run the entire project? We've certainly seen teams that run into SLA issues when project build times take longer than an hour or so to run. And would the resulting "slowdown" outweigh the benefits of the new features and bug fixes in >=1.7? Or perhaps you're observing a performance "smell" that makes you hesitant to upgrade?

contributors to performance degradation

dbt-redshift depends on dbt-core which itself depends on many other packages. I can't speak with authority to the performance of dbt-core or other implicit dependencies, though maybe @peterallenwebb can chime in here.

I can however, speak to the changes that have been made in dbt-redshift over the past year or so of minor versions.

migrating to redshift-connector from psycopg2

This is the largest contributor to performance. Specifically, lack of support in the driver for simple query cycle (aws/amazon-redshift-python-driver#162) requires all queries to be executed in their own transaction.

metadata views

The Redshift team has, as of late, been shipping incredible new Datasharing cross-database and cross-cluster querying features. This is fundamentally not possible in either:

  • Postgres (from which Redshift was originally forked), or
  • Redshift dc2 node provisioned clusters

Datasharing requires database-spanning metadata that information schema tables used traditionally within Redshift (pg_views, pg_tables, etc) do not provide. So now, dbt-redshift must now query the Redshift system tables (svv_*) to get the same information. These metadata calls will not be as performant as single-db calls were. But Redshift tells us that Datasharing and ra3 nodes are the future of the product. They're obviously experts on their product so I'm inclined to agree with them.

Should dbt-redshift chose to both support of the following for Redshift?

With infinite engineering capacity, I'd have zero problem doing so. But alas, there is not, so we must make tradeoffs. We hope to work closely with the Redshift team to maintain the user experience

What's next?

There's an argument to be made for dbt-redshift retaining support for of the below scenarios:

  • "classic" single-database, single-cluster postgres-like scenario
  • forward-looking cross-database, cross-cluster scenarios?

In this world, users in the former scenario Redshift clusters might go back to the performance characteristics of dbt 1.3, but IMO the negatives hugely outweigh this.

In this world, future changes to the Redshift adapter would have to be implemented in two different ways. This is not only more work, but it incurs the risk that it might be impossible to implement new functionality at parity, leading to weird divergences between the two ostensibly-identical adapters.

Additionally, the number of single-db/single cluster Redshift users has surely peaked. People setting up new clusters today will be using the new experience, and over time more and more existing users will switch to RA3 nodes. (Perhaps even you!) So the number of users who would benefit from this additional work are continuously decreasing, making it a poor return on time invested.

Finally, this assumes that the svv_* tables will never get any more performant. Even though there are inherent penalties to cross-cluster queries vs the classic pg_* tables, they're not set in stone as a 50% drop.

To summarize, a better place to direct attention would be improving performance of the queries that dbt runs as part of a run. If redshift-connector exposes higher-performance ways for us to get what we need, we'd be enthusiastic to adopt them.

from dbt-redshift.

lucapazz avatar lucapazz commented on July 30, 2024

Hi @dataders, thank you so much for your reply.

Before answering your questions, I let you know that I took some time to do further investigation.
I repeated my performance test on the jaffle_shop project using dbt-redshift==1.3.0 and then dbt-redshift==1.7.5, in order to compare the queries that were run on Redshift by the two versions of dbt.

I attach a zip file containing the following files:

  • commands output - dbt 1.3.txt (it reports the output for each command invoked with dbt 1.3)
  • redshift queries - dbt 1.3.sql (it contains the SQL commands I used to query the Redshift system views in order to get the queries that were run by dbt 1.3)
  • redshift queries - dbt 1.3.tsv (it contains the results of the SQL commands in the previous file)
  • commands output - dbt 1.7.txt
  • redshift queries - dbt 1.7.sql
  • redshift queries - dbt 1.7.tsv

Some consideration observing those results:

  • Query execution times do not contribute to the total duration gap.
  • Given the dbt run command, the difference between the end time of the last query and the start time of the first query with dbt 1.7 is much bigger than with 1.3.
  • Given the dbt run command, times between consecutive queries to build a model (e.g. create tmp table -> rename final table to backup -> rename tmp table to final name -> drop backup table) are longer with dbt 1.7.
  • Given the dbt run command with dbt 1.7, queries to rename all final tables are executed only after all queries that create tmp tables have completed, even though "create table" and "alter table" queries for the same model are in the same transaction; the same thing happens between the two table renaming queries.

It seems as if the execution time increase is due to less efficient query "scheduling”.

Let's put the jaffle_shop project aside and move on to our use cases...

We populate our DWH using many dbt projects running in distinct Prefect flows. Each project contains a limited number of models and tests, so the execution times of an entire project are not so long.
However, in some cases, we have time constraints: for example, we have a Prefect flow that orchestrates many dbt projects and it has to be completed in an hour.
Unfortunately, at the moment, execution times are more important than the new features from dbt or Redshift.
Speaking of Redshift, we are using RA3 nodes.

If you are interested in having more details on our use cases, we are also available for a meeting to show you our projects :)

from dbt-redshift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.