Coder Social home page Coder Social logo

[ADAP-1077] [Drops Table In Separate Transaction] - Tables are dropped then the _tmp table renamed, causing other queries referencing the table to fail about dbt-redshift HOT 10 OPEN

gtmehrer avatar gtmehrer commented on July 30, 2024 1
[ADAP-1077] [Drops Table In Separate Transaction] - Tables are dropped then the _tmp table renamed, causing other queries referencing the table to fail

from dbt-redshift.

Comments (10)

VersusFacit avatar VersusFacit commented on July 30, 2024 1

@gtmehrer Hey there! Thanks for the submission. Just an upfront disclaimer that autocommit is True by default.

I recreated your logs essentially as you described.

My model:

{{
config(
    materialized='table'
)
}}

select 2

The Logs

You gave some log output, but I'm curious whether that's a red herring of sorts. My thinking: If you dig into core, you'll notice that a new statement for DROP was added, yes! But that's just a log message. The underlying semantics haven't really changed on the calling of that drop statement. Maybe the order has changed. For verifying that, we need more info 🕵️‍♀️

Maybe Mike is right that some special interaction of the new replace macros (added in this PR by the way) could be responsible, but I actually think it's a higher level configuration problem.

In dbt land, autocommit is on by default due to the needs of analytics engineers paired with the inertia of the semantics of psycopg2. Mike tangentially evokes this above.

A possible builtin solution to your issue

This causes queries referencing the refreshing model to fail if they run at the same time since the table doesn't exist.

As in you are seeing the model get dropped while other threads attempt to pull from it, whereas you want the rename and drop to happen in the same transaction?

Could you try with autocommit: False to see if that addresses your issue? That would prevent the always-commit-after-an-operation semantics you're running up against, I think!

Request for more info

Also, if you continue to have troubles, could you say more on your use case? It's not immediately clear what your error is from this description or how it's being triggered, but I'd love to know more. I want to be able to reproduce this. Without your use case in front of me, my best attempts to replicate failed regardless of whether I left autocommit on.

from dbt-redshift.

mikealfare avatar mikealfare commented on July 30, 2024

This is a regression that requires some design, hence the refinement tag. I believe the issue is that we moved from psycopg2 to redshift-connector. redshift-connector implements DBAPI2.0 correctly, which assumes execution of statements one at a time (you need to build the transaction manually if you want it). Whereas psycopg2 allows you to pass it multiple statements and it will automatically "wrap" it in a transaction. The assumption of the latter scenario exists in many places throughout dbt-redshift. This change was made in 1.5, so I'm a little surprised that this was working as a single transaction in 1.6. Perhaps it's the interaction of this scenario, along with the implementation of the replace set of macros that handles swapping tables with materialized views and vice versa.

from dbt-redshift.

martynydbt avatar martynydbt commented on July 30, 2024

@VersusFacit i added a backport label!

from dbt-redshift.

VersusFacit avatar VersusFacit commented on July 30, 2024

There is an alternate possibility the team is exploring: It could be we need to remove some BEGIN/COMMIT barriers in special edge cases like this one you may have found. I have a PR up for that just in case. We'd still very much appreciate you trying the autocommit adjustment as a sanity check and providing more insight into your use case 🖖

from dbt-redshift.

dbeatty10 avatar dbeatty10 commented on July 30, 2024

Mila and I looked at the BEGIN/COMMIT barriers in 1.6 vs. 1.7.

1.6

BEGIN
create  table
    "ci"."dbt_dbeatty"."my_model__dbt_tmp"
    
    
    
  as (
    

select 1 as id
  );
alter table "ci"."dbt_dbeatty"."my_model" rename to "my_model__dbt_backup"
alter table "ci"."dbt_dbeatty"."my_model__dbt_tmp" rename to "my_model"
COMMIT
BEGIN
drop table if exists "ci"."dbt_dbeatty"."my_model__dbt_backup" cascade
COMMIT
BEGIN

1.7

BEGIN
create  table
    "ci"."dbt_dbeatty"."my_model__dbt_tmp"
    
    
    
  as (
    

select 1 as id
  );
COMMIT
BEGIN
drop table if exists "ci"."dbt_dbeatty"."my_model" cascade
COMMIT
BEGIN
alter table "ci"."dbt_dbeatty"."my_model__dbt_tmp" rename to "my_model"
COMMIT
BEGIN
drop table if exists "ci"."dbt_dbeatty"."my_model__dbt_backup" cascade
COMMIT
BEGIN

Here's the diff showing two additional COMMIT/BEGIN pairs in 1.7 vs. 1.6:

12c12,16
< alter table "ci"."dbt_dbeatty"."my_model" rename to "my_model__dbt_backup"
---
> COMMIT
> BEGIN
> drop table if exists "ci"."dbt_dbeatty"."my_model" cascade
> COMMIT
> BEGIN

Here's the nitty-gritty details of how we got those commands (using jq):

Reprex

models/my_model.sql

{{ config(materialized="table") }}

select 1 as id

Install jq in macOS using Homebrew (or an alternative command for your environment):

brew install jq

Create virtual environments for dbt 1.6 and 1.7:

python3 -m venv dbt_1.6
source dbt_1.6/bin/activate
python -m pip install --upgrade pip
python -m pip install dbt-core~=1.6.0 dbt-postgres~=1.6.0 dbt-redshift~=1.6.0
source dbt_1.6/bin/activate
dbt --version
deactivate
python3 -m venv dbt_1.7
source dbt_1.7/bin/activate
python -m pip install --upgrade pip
python -m pip install dbt-core~=1.7.0 dbt-postgres~=1.7.0 dbt-redshift~=1.7.0
source dbt_1.7/bin/activate
dbt --version
deactivate

Use 1.6 to run the model and create JSON logs. Use jq to isolate the relevant log messages, then write them to their own file.

source dbt_1.6/bin/activate
rm -rf logs
dbt --log-format-file json run -s my_model
jq -r -c '.data | select(.node_info.resource_type == "model" and .node_info.node_name == "my_model") | select(.sql != null) | .sql' logs/dbt.log | while IFS= read -r line; do echo $line; done > dbt.log.my_model.1.6.sql
deactivate

Use 1.7 to run the model and create JSON logs. Use jq to isolate the relevant log messages, then write them to their own file.

source dbt_1.7/bin/activate
rm -rf logs
dbt --log-format-file json run -s my_model
jq -r -c '.data | select(.node_info.resource_type == "model" and .node_info.node_name == "my_model") | select(.sql != null) | .sql' logs/dbt.log | while IFS= read -r line; do echo $line; done > dbt.log.my_model.1.7.sql
deactivate

Look at the diff between the the SQL that was executed in 1.6 vs. 1.7:

diff dbt.log.my_model.1.6.sql dbt.log.my_model.1.7.sql
12c12,16
< alter table "ci"."dbt_dbeatty"."my_model" rename to "my_model__dbt_backup"
---
> COMMIT
> BEGIN
> drop table if exists "ci"."dbt_dbeatty"."my_model" cascade
> COMMIT
> BEGIN

from dbt-redshift.

mikealfare avatar mikealfare commented on July 30, 2024

@VersusFacit and I talked through this quite a bit and we're pretty sure we found the issue. The root cause is that we are setting a class variable on BaseRelation in such a way that it does not get properly updated at the implementation layer.

Diagnosis

We set renameable_relations in dbt-core here:
https://github.com/dbt-labs/dbt-core/blob/0a6d0c158e5a1956bd48e53793ca3b28faee2b34/core/dbt/adapters/base/relation.py#L44

and attempt to override it with the correct configuration for dbt-redshift here:

renameable_relations = frozenset(

Unfortunately, because BaseRelation is a frozen dataclass, the override does not take affect. We didn't notice this in 1.6 because dbt-redshift uses the default table materialization in 1.6. However, we needed to use the renameable relations concept in 1.7 because of interactions between tables and materialized views. This required us to override the default table materialization to update this logic:
https://github.com/dbt-labs/dbt-core/blob/0a6d0c158e5a1956bd48e53793ca3b28faee2b34/core/dbt/include/global_project/macros/materializations/models/table.sql#L40

to include this conditional:

{% if existing_relation.can_be_renamed %}
{{ adapter.rename_relation(existing_relation, backup_relation) }}
{% else %}
{{ drop_relation_if_exists(existing_relation) }}
{% endif %}

The issue is that existing_relation.renameable_relations is always empty, hence existing_relation.can_be_renamed is always False. The correct logic path passes the conditional, which would result in the original behavior from dbt-core. Instead, we run through the else path and drop the relation.

Demonstrating Test

The following test in dbt-redshift demonstrates the issue:

from dbt.adapters.redshift.relation import RedshiftRelation
from dbt.contracts.relation import RelationType


def test_renameable_relation():
    relation = RedshiftRelation.create(
        database="my_db",
        schema="my_schema",
        identifier="my_table",
        type=RelationType.Table,
    )
    assert relation.renameable_relations == frozenset({
        RelationType.View,
        RelationType.Table
    })

This test currently fails because relation.renameable_relations==().

Fix

We need to update this line:
https://github.com/dbt-labs/dbt-core/blob/0a6d0c158e5a1956bd48e53793ca3b28faee2b34/core/dbt/adapters/base/relation.py#L44
to use the field method like this:
https://github.com/dbt-labs/dbt-core/blob/0a6d0c158e5a1956bd48e53793ca3b28faee2b34/core/dbt/adapters/base/relation.py#L36
Since the default is a built-in (tuple), you should probably be able to avoid the lambda and do:

renameable_relations: SerializableIterable = field(default_factory=tuple)

While we're here, the same mistake was made in a related area on the next line of code:
https://github.com/dbt-labs/dbt-core/blob/0a6d0c158e5a1956bd48e53793ca3b28faee2b34/core/dbt/adapters/base/relation.py#L50
This should also use the field method to set the empty tuple as a default. The test should be updated (or a second one should be created) to check this is configured correctly as well.

The fix needs to be implemented in dbt-core>=1.6 and dbt-adapters>=1.8. The tests need to be implemented in dbt-snowflake>=1.6, dbt-redshift>=1.6, and dbt-bigquery>=1.7.

Risks

We've been operating under the assumption that we've been using the renameable_relations and replaceable_relations as expected. However, since these were both always False, we may be introducing new logic paths by fixing this bug. Put another way, resolving this bug is effectively implementing renameable_relations, replaceable_relations, and the associated replace macro in some places.

from dbt-redshift.

dbeatty10 avatar dbeatty10 commented on July 30, 2024

That is some great info @mikealfare and @VersusFacit 🧠

While we're in this area of code, here's one other thing that @VersusFacit and I noticed in #693 (comment):

  • in the last statement, my_model__dbt_backup is dropped if it exists.

It doesn't appear to ever get created, so it's a no-op. But it looks like it may be an unnecessary statement.

Maybe it will be naturally handled along with the other fixes?

from dbt-redshift.

VersusFacit avatar VersusFacit commented on July 30, 2024

I dont think it will be since it's an always drop. I think it's just a leftover artifact. I can make a trick to track dealing with this at some point :)

from dbt-redshift.

gtmehrer avatar gtmehrer commented on July 30, 2024

Thanks so much for digging into this! @VersusFacit I did try with autocommit set to both True and False with the same results either way.

Our use case is that we refresh many models frequently throughout the day, around every 30 mins. Outside processes querying dbt managed tables will fail when they attempt to reference a table that has been dropped (when the transaction is immediately commited by dbt). The table is generally only gone for a few seconds before it's recreated by another transaction so it's a little difficult to replicate without high query volume.

Also, it just occurred to me that we are using Redshift's snapshot isolation, which may result in a different error than serializable isolation would. https://docs.aws.amazon.com/redshift/latest/dg/r_STV_DB_ISOLATION_LEVEL.html

Let me know if I can provide more info or help with testing!

from dbt-redshift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.