Comments (11)
Hey @alepuccetti, this issue is most welcome! Here's the deal: Normally, in full-refresh mode, dbt can atomically replace a BigQuery table by running create or replace table
. This should work the vast majority of the time. That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:
What I'm hearing from you is: In the case where the table is not replaceable and must be hard-refreshed, dbt should try building the new table first in a temp location, then (if and only if the build succeeds) drop and swap. I completely agree! It's preferable for both of the reasons you mention, and this is how dbt replaces table models on databases without atomic create or replace
DDL (e.g. Postgres, Redshift).
In fact, dbt-bigquery already has an adapter.rename_relation
method that will take care of steps 3-4 (copy + delete temp version).
So I think this would look like changing the code above:
{% elif full_refresh_mode %}
{#-- If the partition/cluster config has changed, then we must hard refresh: --#}
{#-- create new table in a temp location, drop, and swap --#}
{% set must_hard_refresh = (not adapter.is_replaceable(existing_relation, partition_by, cluster_by)) %}
{% if must_hard_refresh %}
{% do log("Creating " ~ tmp_relation ~ " because " ~ existing_relation ~ " is not replaceable") %}
{% set build_sql = create_table_as(False, tmp_relation, sql) %}
{% else %}
{% set build_sql = create_table_as(False, target_relation, sql) %}
{% endif %}
{% else %}
Then further down, after calling the 'main'
statement that includes build_sql
but before running hooks, you would add:
{% if full_refresh_mode and must_hard_refresh %}
{% do log("Hard refreshing " ~ existing_relation ~ " because it is not replaceable") %}
{{ adapter.drop_relation(existing_relation) }}
{{ adapter.rename_relation(tmp_relation, existing_relation) }}
{% endif %}
And I think that would just work!
It sounds like this is a code change you're interested in contributing. Let me know if you'd like any help getting started :)
from dbt-bigquery.
My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case.
Indeed, that's not the desired behavior; it sounds like a bug with is_replaceable
. Which version of dbt are you using? Any chance you could stick ipbd
/breakpoint()
into that method, try running full_refresh_test
, and see what's up?
The proposal in this issue still has merit regardless—making hard refresh more seamless with less downtime is a net-good thing—but really they should be a rare occurrence.
from dbt-bigquery.
@alepuccetti Hm, I'm unable to reproduce this. Could you open a separate issue with all information about your model config (both in config()
and in dbt_project.yml
), so we can get to the bottom of what's happening?
Ideally, we'd be able to resolve both issues:
- Ensure that dbt is hard-refreshing models only when it strictly needs to (and figure out if/why it's doing it more often than it should)
- When dbt must hard-refresh a model, make the process as seamless as possible (minimal downtime, better rollback on failure).
Let's keep this issue open for discussion of the latter.
from dbt-bigquery.
@gareginordyan : we had our fair share of problems with strange hard refreshes on non-incremental but partitioned and clustered tables. It all came down to the fact that we were using the cluster_by: "column_a,column_b"
syntax instead of the documented cluster_by = ["column_a", "column_b"],
syntax and when dbt was checking if the table is replaceable or not, it always came back with False since ["column_a,column_b"]
does not equal ["column_a", "column_b"]
.
dbt-bigquery/dbt/adapters/bigquery/impl.py
Line 532 in 07bc144
from dbt-bigquery.
@gareginordyan yes tracked it down - and @balazs-mate it was just as you describe with models materialized as table
(so non-incremental and not full-refresh specific) and tables that were both partitioned and clustered (with a complex, four-column clustering scheme). We did not have the same syntax issues you describe, though.
Example of the config causing issues:
{{
config(
schema='transforms_bi',
partition_by={'field': 'date', 'granularity': 'day'},
cluster_by = ['sf_account_id', 'fivetran_account_id', '`group`', 'integration',],
materialized='table'
)
}}
We were able to resolve the downtime issue by removing partitioning and dropping our clustering scheme to two columns:
{{
config(
schema='transforms_bi',
cluster_by = ['sf_account_id', 'date'],
materialized='table'
)
}}
which led us to believe that the downtime issue was related to the way dbt and/or BigQuery was handling creating the partitioning - the raw downtime (as measured by the slot time consumed by the new version) was lower but not significantly lower. We now favor clustering over partitioning in nearly all cases and haven't seen these issues since.
This was all back in October 2021, running dbt version 0.19.1.
from dbt-bigquery.
That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:
Exactly other wise a simple create or replace
would do. Also, this could play well with another issue I saw regarding applying the model changes after the tests passes and not running the tests after the model has been re-created (sorry, cannot find the issue now).
Thank for the pointers for the implementation 😉.
However regarding this:
In the case where the table is not replaceable and must be hard-refreshed
My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case. My test model has amaterialized='incremental'and
incremental_strategy=insert_overwrite`. And my debug log is:
2021-03-10 10:53:48.234858 (Thread-1): Began running node model.dbt_project.full_refresh_test
2021-03-10 10:53:48.237000 (Thread-1): 11:53:48 | 1 of 1 START incremental model alessandro__staging.full_refresh_test. [RUN]
2021-03-10 10:53:48.237549 (Thread-1): Acquiring new bigquery connection "model.dbt_project.full_refresh_test".
2021-03-10 10:53:48.237818 (Thread-1): Compiling model.project.full_refresh_test
2021-03-10 10:53:48.268434 (Thread-1): Writing injected SQL for node "model.dbt_project.full_refresh_test"
2021-03-10 10:53:48.361345 (Thread-1): Opening a new connection, currently in state closed
2021-03-10 10:53:48.637873 (Thread-1): Hard refreshing `project_id`.`alessandro__staging`.`full_refresh_test` because it is not replaceable
2021-03-10 10:53:48.865436 (Thread-1): Writing runtime SQL for node "model.dbt_project.full_refresh_test"
For some reason, this happen always. I have also tested with materialized='table'
. @jtcohen6. Am I missing something or this should not the desired behaviour?
from dbt-bigquery.
Which version of dbt are you using?
I am using 0.19.0
from dbt-bigquery.
if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table
@jtcohen6 this was some time ago, but if I remember correctly, in our case our cluster/partition keys had not been changing and we were still having this downtime on a regular basis.
They were just simply regular materialized models (non-incremental) that had both partition and cluster keys. I'll try to dig out the actual config.
cc: @bcolbert978
from dbt-bigquery.
@bcolbert978 I can't seen to find it in slack, but if you by any chance have the model config handy please paste here
from dbt-bigquery.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
from dbt-bigquery.
I'm seeing a similar issue. I am materializing a BigQuery table that is partitioned and clustered - I run dbt run --select my_table every 15 minutes, and the downstream consumers report periodic downtime with errors like Not found: Table my_table was not found in location US
.
from dbt-bigquery.
Related Issues (20)
- [CT-3486] [Feature] Include JSON-formatted query comments for all queries within incremental models HOT 6
- “Syntax error: Unexpected keyword DECLARE” on valid query string HOT 1
- Set_sql_header is incompatible with {{ this }}
- [Bug] Model run throws - Operation did not complete within designated timeout of 900 seconds HOT 2
- [Bug] Error when loading NULL result from JSON column HOT 1
- [CT-1153] [Feature] BigQuery Incremental - `on_schema_change: ignore` cannot ignore missing columns HOT 7
- Python models are running out of CPUs in CI HOT 1
- [Feature] Spike on supporting Py3.12
- [Bug] materialized view failed to compile and re-run after upgrading to 1.7.4
- Pass (modify/append) BigQuery job label as a CLI flag
- [Bug] Missing container image for dbt-bigquery-1.7.5 HOT 5
- [Bug] List types don't save to BigQuery using 1.7 (but did in 1.5)
- [Bug] _dbt_max_partition created in full-refresh runs if mentioned in the comments HOT 1
- [Bug] Cluster order for Materialized Views is mixed up HOT 1
- [Bug] Temporary table does not use partition expiry from incremental model configuration
- delimiters in properties.yml doesn't work correctly HOT 7
- [Bug] Incremental model run `CREATE` statement though the table exists HOT 2
- [Bug] Add test for 9682 in core HOT 1
- [Bug] Add test for 9682 in core against 1.8
- [Bug] Add test for 9682 in core against 1.7 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-bigquery.