Describe the feature Change the full refresh behaviour for BigQuer

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Remove / reduce downtime when running full refresh about dbt-bigquery HOT 11 CLOSED

dbt-labs commented on May 24, 2024 8

Remove / reduce downtime when running full refresh

from dbt-bigquery.

Comments (11)

jtcohen6 commented on May 24, 2024 1

Hey @alepuccetti, this issue is most welcome! Here's the deal: Normally, in full-refresh mode, dbt can atomically replace a BigQuery table by running create or replace table. This should work the vast majority of the time. That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:

https://github.com/fishtown-analytics/dbt/blob/24e4b75c354e0f6bc5af7b51c4bb220428c84d5b/plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql#L105-L112

What I'm hearing from you is: In the case where the table is not replaceable and must be hard-refreshed, dbt should try building the new table first in a temp location, then (if and only if the build succeeds) drop and swap. I completely agree! It's preferable for both of the reasons you mention, and this is how dbt replaces table models on databases without atomic create or replace DDL (e.g. Postgres, Redshift).

In fact, dbt-bigquery already has an adapter.rename_relation method that will take care of steps 3-4 (copy + delete temp version).

So I think this would look like changing the code above:

  {% elif full_refresh_mode %} 
     {#-- If the partition/cluster config has changed, then we must hard refresh: --#}
     {#-- create new table in a temp location, drop, and swap --#} 
     {% set must_hard_refresh = (not adapter.is_replaceable(existing_relation, partition_by, cluster_by)) %} 
     {% if must_hard_refresh %}
         {% do log("Creating " ~ tmp_relation ~ " because " ~ existing_relation ~ " is not replaceable") %} 
         {% set build_sql = create_table_as(False, tmp_relation, sql) %} 
     {% else %}
         {% set build_sql = create_table_as(False, target_relation, sql) %} 
     {% endif %}
 {% else %}

Then further down, after calling the 'main' statement that includes build_sql but before running hooks, you would add:

{% if full_refresh_mode and must_hard_refresh %}
    {% do log("Hard refreshing " ~ existing_relation ~ " because it is not replaceable") %} 
    {{ adapter.drop_relation(existing_relation) }}
    {{ adapter.rename_relation(tmp_relation, existing_relation) }}
{% endif %}

And I think that would just work!

It sounds like this is a code change you're interested in contributing. Let me know if you'd like any help getting started :)

from dbt-bigquery.

jtcohen6 commented on May 24, 2024 1

My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case.

Indeed, that's not the desired behavior; it sounds like a bug with is_replaceable. Which version of dbt are you using? Any chance you could stick ipbd/breakpoint() into that method, try running full_refresh_test, and see what's up?

The proposal in this issue still has merit regardless—making hard refresh more seamless with less downtime is a net-good thing—but really they should be a rare occurrence.

from dbt-bigquery.

jtcohen6 commented on May 24, 2024 1

@alepuccetti Hm, I'm unable to reproduce this. Could you open a separate issue with all information about your model config (both in config() and in dbt_project.yml), so we can get to the bottom of what's happening?

Ideally, we'd be able to resolve both issues:

Ensure that dbt is hard-refreshing models only when it strictly needs to (and figure out if/why it's doing it more often than it should)
When dbt must hard-refresh a model, make the process as seamless as possible (minimal downtime, better rollback on failure).

Let's keep this issue open for discussion of the latter.

from dbt-bigquery.

balazs-mate commented on May 24, 2024 1

@gareginordyan : we had our fair share of problems with strange hard refreshes on non-incremental but partitioned and clustered tables. It all came down to the fact that we were using the cluster_by: "column_a,column_b" syntax instead of the documented cluster_by = ["column_a", "column_b"], syntax and when dbt was checking if the table is replaceable or not, it always came back with False since ["column_a,column_b"] does not equal ["column_a", "column_b"].

dbt-bigquery/dbt/adapters/bigquery/impl.py

Line 532 in 07bc144

def _clusters_match(self, table, conf_cluster) -> bool:

from dbt-bigquery.

bcolbert978 commented on May 24, 2024 1

@gareginordyan yes tracked it down - and @balazs-mate it was just as you describe with models materialized as table (so non-incremental and not full-refresh specific) and tables that were both partitioned and clustered (with a complex, four-column clustering scheme). We did not have the same syntax issues you describe, though.

Example of the config causing issues:

{{
  config(
    schema='transforms_bi',
    partition_by={'field': 'date', 'granularity': 'day'},
    cluster_by = ['sf_account_id', 'fivetran_account_id', '`group`', 'integration',],
    materialized='table'
  )
}}

We were able to resolve the downtime issue by removing partitioning and dropping our clustering scheme to two columns:

{{
  config(
    schema='transforms_bi',
    cluster_by = ['sf_account_id', 'date'],
    materialized='table'
  )
}}

which led us to believe that the downtime issue was related to the way dbt and/or BigQuery was handling creating the partitioning - the raw downtime (as measured by the slot time consumed by the new version) was lower but not significantly lower. We now favor clustering over partitioning in nearly all cases and haven't seen these issues since.

This was all back in October 2021, running dbt version 0.19.1.

from dbt-bigquery.

alepuccetti commented on May 24, 2024

That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:

Exactly other wise a simple create or replace would do. Also, this could play well with another issue I saw regarding applying the model changes after the tests passes and not running the tests after the model has been re-created (sorry, cannot find the issue now).

Thank for the pointers for the implementation 😉.

However regarding this:

In the case where the table is not replaceable and must be hard-refreshed

My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case. My test model has amaterialized='incremental'and incremental_strategy=insert_overwrite`. And my debug log is:

2021-03-10 10:53:48.234858 (Thread-1): Began running node model.dbt_project.full_refresh_test
2021-03-10 10:53:48.237000 (Thread-1): 11:53:48 | 1 of 1 START incremental model alessandro__staging.full_refresh_test. [RUN]
2021-03-10 10:53:48.237549 (Thread-1): Acquiring new bigquery connection "model.dbt_project.full_refresh_test".
2021-03-10 10:53:48.237818 (Thread-1): Compiling model.project.full_refresh_test
2021-03-10 10:53:48.268434 (Thread-1): Writing injected SQL for node "model.dbt_project.full_refresh_test"
2021-03-10 10:53:48.361345 (Thread-1): Opening a new connection, currently in state closed
2021-03-10 10:53:48.637873 (Thread-1): Hard refreshing `project_id`.`alessandro__staging`.`full_refresh_test` because it is not replaceable
2021-03-10 10:53:48.865436 (Thread-1): Writing runtime SQL for node "model.dbt_project.full_refresh_test"

For some reason, this happen always. I have also tested with materialized='table'. @jtcohen6. Am I missing something or this should not the desired behaviour?

from dbt-bigquery.

alepuccetti commented on May 24, 2024

Which version of dbt are you using?

I am using 0.19.0

from dbt-bigquery.

gareginordyan commented on May 24, 2024

if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table

@jtcohen6 this was some time ago, but if I remember correctly, in our case our cluster/partition keys had not been changing and we were still having this downtime on a regular basis.

They were just simply regular materialized models (non-incremental) that had both partition and cluster keys. I'll try to dig out the actual config.

cc: @bcolbert978

from dbt-bigquery.

gareginordyan commented on May 24, 2024

@bcolbert978 I can't seen to find it in slack, but if you by any chance have the model config handy please paste here

from dbt-bigquery.

github-actions commented on May 24, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

from dbt-bigquery.

OmarArain commented on May 24, 2024

I'm seeing a similar issue. I am materializing a BigQuery table that is partitioned and clustered - I run dbt run --select my_table every 15 minutes, and the downstream consumers report periodic downtime with errors like Not found: Table my_table was not found in location US.

from dbt-bigquery.

Remove / reduce downtime when running full refresh about dbt-bigquery HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent