aws-samples / dbt-glue Goto Github PK

This repository contains de dbt-glue adapter

License: Apache License 2.0

Python 99.82% Shell 0.18%

dbt-glue's Introduction

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-glue

The dbt-glue package implements the dbt adapter protocol for AWS Glue's Spark engine. It supports running dbt against Spark, through the new Glue Interactive Sessions API.

To learn how to deploy a data pipeline in your modern data platform using the dbt-glue adapter, please read the following blog post: Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core

Installation

The package can be installed from PyPI with:

$ pip3 install dbt-glue

For further (and more likely up-to-date) info, see the README

Connection Methods

Configuring your AWS profile for Glue Interactive Session

There are two IAM principals used with interactive sessions.

Client principal: The princpal (either user or role) calling the AWS APIs (Glue, Lake Formation, Interactive Sessions) from the local client. This is the principal configured in the AWS CLI and likely the same.
Service role: The IAM role that AWS Glue uses to execute your session. This is the same as AWS Glue ETL.

Read this documentation to configure these principals.

You will find bellow a least privileged policy to enjoy all features of dbt-glue adapter.

Please to update variables between <>, here are explanations of these arguments:

Args	Description
region	The region where your Glue database is stored
AWS Account	The AWS account where you run your pipeline
dbt output database	The database updated by dbt (this is the schema configured in the profile.yml of your dbt environment)
dbt source database	All databases used as source
dbt output bucket	The bucket name where the data will be generate dbt (the location configured in the profile.yml of your dbt environment)
dbt source bucket	The bucket name of source databases (if they are not managed by Lake Formation)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Read_and_write_databases",
            "Action": [
                "glue:SearchTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartitionIndex",
                "glue:DeleteDatabase",
                "glue:GetTableVersions",
                "glue:GetPartitions",
                "glue:DeleteTableVersion",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:DeletePartitionIndex",
                "glue:GetTableVersion",
                "glue:UpdateColumnStatisticsForTable",
                "glue:CreatePartition",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:GetTables",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:GetPartition",
                "glue:UpdateColumnStatisticsForPartition",
                "glue:CreateDatabase",
                "glue:BatchDeleteTableVersion",
                "glue:BatchDeleteTable",
                "glue:DeletePartition",
                "glue:GetUserDefinedFunctions",
                "lakeformation:ListResources",
                "lakeformation:BatchGrantPermissions",
                "lakeformation:ListPermissions", 
                "lakeformation:GetDataAccess",
                "lakeformation:GrantPermissions",
                "lakeformation:RevokePermissions",
                "lakeformation:BatchRevokePermissions",
                "lakeformation:AddLFTagsToResource",
                "lakeformation:RemoveLFTagsFromResource",
                "lakeformation:GetResourceLFTags",
                "lakeformation:ListLFTags",
                "lakeformation:GetLFTag",
            ],
            "Resource": [
                "arn:aws:glue:<region>:<AWS Account>:catalog",
                "arn:aws:glue:<region>:<AWS Account>:table/<dbt output database>/*",
                "arn:aws:glue:<region>:<AWS Account>:database/<dbt output database>"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "Read_only_databases",
            "Action": [
                "glue:SearchTables",
                "glue:GetTableVersions",
                "glue:GetPartitions",
                "glue:GetTableVersion",
                "glue:GetTables",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:GetPartition",
                "lakeformation:ListResources",
                "lakeformation:ListPermissions"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<AWS Account>:table/<dbt source database>/*",
                "arn:aws:glue:<region>:<AWS Account>:database/<dbt source database>",
                "arn:aws:glue:<region>:<AWS Account>:database/default",
                "arn:aws:glue:<region>:<AWS Account>:database/global_temp"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "Storage_all_buckets",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<dbt output bucket>",
                "arn:aws:s3:::<dbt source bucket>"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "Read_and_write_buckets",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<dbt output bucket>"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "Read_only_buckets",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<dbt source bucket>"
            ],
            "Effect": "Allow"
        }
    ]
}

Configuration of the local environment

Because dbt and dbt-glue adapter are compatible with Python versions 3.7, 3.8, and 3.9, check the version of Python:

$ python3 --version

Configure a Python virtual environment to isolate package version and code dependencies:

$ python3 -m venv dbt_venv
$ source dbt_venv/bin/activate
$ python3 -m pip install --upgrade pip

Configure the last version of AWS CLI

$ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
$ unzip awscliv2.zip
$ sudo ./aws/install

Install boto3 package

$ sudo yum install gcc krb5-devel.x86_64 python3-devel.x86_64 -y
$ pip3 install —-upgrade boto3

Install the package:

$ pip3 install dbt-glue

Example config

type: glue
query-comment: This is a glue dbt example
role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
region: us-east-1
workers: 2
worker_type: G.1X
idle_timeout: 10
schema: "dbt_demo"
session_provisioning_timeout_in_seconds: 120
location: "s3://dbt_demo_bucket/dbt_demo_data"

The table below describes all the options.

Option	Description	Mandatory
project_name	The dbt project name. This must be the same as the one configured in the dbt project.	yes
type	The driver to use.	yes
query-comment	A string to inject as a comment in each query that dbt runs.	no
role_arn	The ARN of the glue interactive session IAM role.	yes
region	The AWS Region were you run the data pipeline.	yes
workers	The number of workers of a defined workerType that are allocated when a job runs.	yes
worker_type	The type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X.	yes
schema	The schema used to organize data stored in Amazon S3.Additionally, is the database in AWS Lake Formation that stores metadata tables in the Data Catalog.	yes
session_provisioning_timeout_in_seconds	The timeout in seconds for AWS Glue interactive session provisioning.	yes
location	The Amazon S3 location of your target data.	yes
query_timeout_in_minutes	The timeout in minutes for a signle query. Default is 300	no
idle_timeout	The AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time)	no
glue_version	The version of AWS Glue for this session to use. Currently, the only valid options are 2.0, 3.0 and 4.0. The default value is 4.0.	no
security_configuration	The security configuration to use with this session.	no
connections	A comma-separated list of connections to use in the session.	no
conf	Specific configuration used at the startup of the Glue Interactive Session (arg --conf)	no
extra_py_files	Extra python Libs that can be used by the interactive session.	no
delta_athena_prefix	A prefix used to create Athena compatible tables for Delta tables (if not specified, then no Athena compatible table will be created)	no
tags	The map of key value pairs (tags) belonging to the session. Ex: `KeyName1=Value1,KeyName2=Value2`	no
seed_format	By default `parquet`, can be Spark format compatible like `csv` or `json`	no
seed_mode	By default `overwrite`, the seed data will be overwritten, you can set it to `append` if you just want to add new data in your dataset	no
default_arguments	The map of key value pairs parameters belonging to the session. More information on Job parameters used by AWS Glue. Ex: `--enable-continuous-cloudwatch-log=true,--enable-continuous-log-filter=true`	no
glue_session_id	re-use a glue-session to run multiple dbt run commands. Will create a new glue-session using glue_session_id if it does not exists yet.	no
glue_session_reuse	re-use the glue-session to run multiple dbt run commands: If set to true, the glue session will not be closed for re-use. If set to false, the session will be closed. The glue session will close after idle_timeout time is expired after idle_timeout time	no
datalake_formats	The ACID datalake format that you want to use if you are doing merge, can be `hudi`, `ìceberg` or `delta`	no
use_arrow	(experimental) use an arrow file instead of stdout to have better scalability.	no

Configs

Configuring tables

When materializing a model as table, you may include several optional configs that are specific to the dbt-spark plugin, in addition to the standard model configs.

Option	Description	Required?	Example
file_format	The file format to use when creating tables (`parquet`, `csv`, `json`, `text`, `jdbc` or `orc`).	Optional	`parquet`
partition_by	Partition the created table by the specified columns. A directory is created for each partition.	Optional	`date_day`
clustered_by	Each partition in the created table will be split into a fixed number of buckets by the specified columns.	Optional	`country_code`
buckets	The number of buckets to create while clustering	Required if `clustered_by` is specified	`8`
custom_location	By default, the adapter will store your data in the following path: `location path`/`schema`/`table`. If you don't want to follow that default behaviour, you can use this parameter to set your own custom location on S3	No	`s3://mycustombucket/mycustompath`
hudi_options	When using file_format `hudi`, gives the ability to overwrite any of the default configuration options.	Optional	`{'hoodie.schema.on.read.enable': 'true'}`

Incremental models

dbt seeks to offer useful and intuitive modeling abstractions by means of its built-in configurations and materializations.

For that reason, the dbt-glue plugin leans heavily on the incremental_strategy config. This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of three values:

append (default): Insert new records without updating or overwriting any existing data.
insert_overwrite: If partition_by is specified, overwrite partitions in the table with new data. If no partition_by is specified, overwrite the entire table with new data.
merge (Apache Hudi and Apache Iceberg only): Match records based on a unique_key; update old records, insert new ones. (If no unique_key is specified, all new data is inserted, similar to append.)

Each of these strategies has its pros and cons, which we'll discuss below. As with any model config, incremental_strategy may be specified in dbt_project.yml or within a model file's config() block.

Notes: The default strategy is insert_overwrite

The `append` strategy

Following the append strategy, dbt will perform an insert into statement with all new data. The appeal of this strategy is that it is straightforward and functional across all platforms, file types, connection methods, and Apache Spark versions. However, this strategy cannot update, overwrite, or delete existing data, so it is likely to insert duplicate records for many data sources.

Source code

{{ config(
    materialized='incremental',
    incremental_strategy='append',
) }}

--  All rows returned by this query will be appended to the existing table

select * from {{ ref('events') }}
{% if is_incremental() %}
  where event_ts > (select max(event_ts) from {{ this }})
{% endif %}

Run Code

create temporary view spark_incremental__dbt_tmp as

    select * from analytics.events

    where event_ts >= (select max(event_ts) from {{ this }})

;

insert into table analytics.spark_incremental
    select `date_day`, `users` from spark_incremental__dbt_tmp

The `insert_overwrite` strategy

This strategy is most effective when specified alongside a partition_by clause in your model config. dbt will run an atomic insert overwrite statement that dynamically replaces all partitions included in your query. Be sure to re-select all of the relevant data for a partition when using this incremental strategy.

If no partition_by is specified, then the insert_overwrite strategy will atomically replace all contents of the table, overriding all existing data with only the new records. The column schema of the table remains the same, however. This can be desirable in some limited circumstances, since it minimizes downtime while the table contents are overwritten. The operation is comparable to running truncate + insert on other databases. For atomic replacement of Delta-formatted tables, use the table materialization (which runs create or replace) instead.

Source Code

{{ config(
    materialized='incremental',
    partition_by=['date_day'],
    file_format='parquet'
) }}

/*
  Every partition returned by this query will be overwritten
  when this model runs
*/

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    date_day,
    count(*) as users

from events
group by 1

Run Code

create temporary view spark_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events


        where date_day >= date_add(current_date, -1)


    )

    select
        date_day,
        count(*) as users

    from events
    group by 1

;

insert overwrite table analytics.spark_incremental
    partition (date_day)
    select `date_day`, `users` from spark_incremental__dbt_tmp

Specifying insert_overwrite as the incremental strategy is optional, since it's the default strategy used when none is specified.

The `merge` strategy

Compatibility:

Hudi : OK
Delta Lake : OK
Iceberg : OK
Lake Formation Governed Tables : On going

NB:

For Glue 3: you have to setup a Glue connectors.
For Glue 4: use the datalake_formats option in your profile.yml

When using a connector be sure that your IAM role has these policies:

{
    "Sid": "access_to_connections",
    "Action": [
        "glue:GetConnection",
        "glue:GetConnections"
    ],
    "Resource": [
        "arn:aws:glue:<region>:<AWS Account>:catalog",
        "arn:aws:glue:<region>:<AWS Account>:connection/*"
    ],
    "Effect": "Allow"
}

and that the managed policy AmazonEC2ContainerRegistryReadOnly is attached. Be sure that you follow the getting started instructions here.

This blog post also explain how to setup and works with Glue Connectors

Hudi

Usage notes: The merge with Hudi incremental strategy requires:

To add file_format: hudi in your table configuration
To add a datalake_formats in your profile : datalake_formats: hudi
- Alternatively, to add a connections in your profile : connections: name_of_your_hudi_connector
To add Kryo serializer in your Interactive Session Config (in your profile): conf: spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false

dbt will run an atomic merge statement which looks nearly identical to the default merge behavior on Snowflake and BigQuery. If a unique_key is specified (recommended), dbt will update old records with values from new records that match on the key column. If a unique_key is not specified, dbt will forgo match criteria and simply insert all new records (similar to append strategy).

Profile config example

test_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      conf: spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
      datalake_formats: hudi

Source Code example

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
    file_format='hudi',
    hudi_options={
        'hoodie.datasource.write.precombine.field': 'eventtime',
    }
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

Delta

You can also use Delta Lake to be able to use merge feature on tables.

Usage notes: The merge with Delta incremental strategy requires:

To add file_format: delta in your table configuration
To add a datalake_formats in your profile : datalake_formats: delta
- Alternatively, to add a connections in your profile : connections: name_of_your_delta_connector
To add the following config in your Interactive Session Config (in your profile): conf: "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Athena: Athena is not compatible by default with delta tables, but you can configure the adapter to create Athena tables on top of your delta table. To do so, you need to configure the two following options in your profile:

delta_athena_prefix: "the_prefix_of_your_choice"
If your table is partitioned, then the add of new partition is not automatic, you need to perform an MSCK REPAIR TABLE your_delta_table after each new partition adding

Profile config example

test_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      datalake_formats: delta
      conf: "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
      delta_athena_prefix: "delta"

Source Code example

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
    partition_by=['dt'],
    file_format='delta'
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen,
    current_date() as dt

from events
group by 1

Iceberg

Usage notes: The merge with Iceberg incremental strategy requires:

To add file_format: Iceberg in your table configuration
To add a datalake_formats in your profile : datalake_formats: iceberg
- Alternatively, if you use Glue 3.0 or more, to add a connections in your profile : connections: name_of_your_iceberg_connector (
  - For Athena version 3:
    - The adapter is compatible with the Iceberg Connector from AWS Marketplace with Glue 3.0 as Fulfillment option and 0.14.0 (Oct 11, 2022) as Software version)
    - the latest connector for iceberg in AWS marketplace uses Ver 0.14.0 for Glue 3.0, and Ver 1.2.1 for Glue 4.0 where Kryo serialization fails when writing iceberg, use "org.apache.spark.serializer.JavaSerializer" for spark.serializer instead, more info here
  - For Athena version 2: The adapter is compatible with the Iceberg Connector from AWS Marketplace with Glue 3.0 as Fulfillment option and 0.12.0-2 (Feb 14, 2022) as Software version)
For Glue 4.0, to add the following configurations in dbt-profile:

    --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
    --conf spark.sql.catalog.glue_catalog.warehouse=s3://<PATH_TO_YOUR_WAREHOUSE>
    --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
    --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

For Glue 3.0, you need to set up more configurations :

    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
    --conf spark.sql.warehouse=s3://<your-bucket-name>
    --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
    --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
    --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
    --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoDbLockManager
    --conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable  
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Also note that for Glue 4.0, you can choose between Glue Optimistic Locking (enabled by default) and DynamoDB Lock Manager for concurrent update to a table.
- If you want to activate DynamoDB Lock Manager set the below config in your profiles. A DynamoDB would be created on your behalf (if it does not exist).

    --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
    --conf spark.sql.catalog.glue_catalog.lock.table=<DYNAMODB_TABLE_NAME>

You'll also need to grant the dbt-glue execution role with the appropriate permissions on DynamoDB

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CommitLockTable",
            "Effect": "Allow",
            "Action": [
                "dynamodb:CreateTable",
                "dynamodb:BatchGetItem",
                "dynamodb:BatchWriteItem",
                "dynamodb:ConditionCheckItem",
                "dynamodb:PutItem",
                "dynamodb:DescribeTable",
                "dynamodb:DeleteItem",
                "dynamodb:GetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT_ID>:table/<DYNAMODB_TABLE_NAME>"
        }
    ]
}

Note that if you use Glue 3.0 DynamoDB Lock Manager is the only option available and you need to set org.apache.iceberg.aws.glue.DynamoLockManager instead :

    --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoDbLockManager
    --conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable

dbt will run an atomic merge statement which looks nearly identical to the default merge behavior on Snowflake and BigQuery. You need to provide a unique_key to perform merge operation otherwise it will fail. This key is to provide in a Python list format and can contains multiple column name to create a composite unique_key.

Notes

When using a custom_location in Iceberg, avoid to use final trailing slash. Adding a final trailing slash lead to an un-proper handling of the location, and issues when reading the data from query engines like Trino. The issue should be fixed for Iceberg version > 0.13. Related Github issue can be find here.
Iceberg also supports insert_overwrite and append strategies.
The warehouse conf must be provided, but it's overwritten by the adapter location in your profile or custom_location in model configuration.
By default, this materialization has iceberg_expire_snapshots set to 'True', if you need to have historical auditable changes, set: iceberg_expire_snapshots='False'.
Currently, due to some dbt internal, the iceberg catalog used internally when running glue interactive sessions with dbt-glue has a hardcoded name glue_catalog. This name is an alias pointing to the AWS Glue Catalog but is specific to each session. If you want to interact with your data in another session without using dbt-glue (from a Glue Studio notebook, for example), you can configure another alias (ie. another name for the Iceberg Catalog). To illustrate this concept, you can set in your configuration file :

--conf spark.sql.catalog.RandomCatalogName=org.apache.iceberg.spark.SparkCatalog

And then run in an AWS Glue Studio Notebook a session with the following config:

--conf spark.sql.catalog.AnotherRandomCatalogName=org.apache.iceberg.spark.SparkCatalog

In both cases, the underlying catalog would be the AWS Glue Catalog, unique in your AWS Account and Region, and you would be able to work with the exact same data. Also make sure that if you change the name of the Glue Catalog Alias, you change it in all the other --conf where it's used:

 --conf spark.sql.catalog.RandomCatalogName=org.apache.iceberg.spark.SparkCatalog 
 --conf spark.sql.catalog.RandomCatalogName.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
 ...
 --conf spark.sql.catalog.RandomCatalogName.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager

A full reference to table_properties can be found here.
Iceberg Tables are natively supported by Athena. Therefore, you can query tables created and operated with dbt-glue adapter from Athena.
Incremental Materialization with Iceberg file format supports dbt snapshot. You are able to run a dbt snapshot command that queries an Iceberg Table and create a dbt fashioned snapshot of it.

Profile config example

test_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      datalake_formats: iceberg
      conf: --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.warehouse=s3://aws-dbt-glue-datalake-1234567890-eu-west-1/dbt_test_project --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Source Code example

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=['user_id'],
    file_format='iceberg',
    iceberg_expire_snapshots='False', 
    partition_by=['status']
    table_properties={'write.target-file-size-bytes': '268435456'}
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

Iceberg Snapshot source code example

{% snapshot demosnapshot %}

{{
    config(
        strategy='timestamp',
        target_schema='jaffle_db',
        updated_at='dt',
        file_format='iceberg'
) }}

select * from {{ ref('customers') }}

{% endsnapshot %}

Monitoring your Glue Interactive Session

Monitoring is an important part of maintaining the reliability, availability, and performance of AWS Glue and your other AWS solutions. AWS provides monitoring tools that you can use to watch AWS Glue, identify the required number of workers required for your Glue Interactive Session, report when something is wrong and take action automatically when appropriate. AWS Glue provides Spark UI, and CloudWatch logs and metrics for monitoring your AWS Glue jobs. More information on: Monitoring AWS Glue Spark jobs

Usage notes: Monitoring requires:

To add the following IAM policy to your IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudwatchMetrics",
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:namespace": "Glue"
                }
            }
        },
        {
            "Sid": "CloudwatchLogs",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:/aws-glue/*",
                "arn:aws:s3:::bucket-to-write-sparkui-logs/*"
            ]
        }
    ]
}

To add monitoring parameters in your Interactive Session Config (in your profile). More information on Job parameters used by AWS Glue

Profile config example

test_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      default_arguments: "--enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://bucket-to-write-sparkui-logs/dbt/"

If you want to use the Spark UI, you can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker. More information on Launching the Spark history server

Enabling AWS Glue Auto Scaling

Auto Scaling is available since AWS Glue version 3.0 or later. More information on the following AWS blog post: "Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark"

With Auto Scaling enabled, you will get the following benefits:

AWS Glue automatically adds and removes workers from the cluster depending on the parallelism at each stage or microbatch of the job run.
It removes the need for you to experiment and decide on the number of workers to assign for your AWS Glue Interactive sessions.
Once you choose the maximum number of workers, AWS Glue will choose the right size resources for the workload.
You can see how the size of the cluster changes during the Glue Interactive sessions run by looking at CloudWatch metrics. More information on Monitoring your Glue Interactive Session.

Usage notes: AWS Glue Auto Scaling requires:

To set your AWS Glue version 3.0 or later.
To set the maximum number of workers (if Auto Scaling is enabled, the workers parameter sets the maximum number of workers)
To set the --enable-auto-scaling=true parameter on your Glue Interactive Session Config (in your profile). More information on Job parameters used by AWS Glue

Profile config example

test_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      default_arguments: "--enable-auto-scaling=true"

Access Glue catalog in another AWS account

In many cases, you may need to run you dbt jobs to read from another AWS account.

Review the following link https://repost.aws/knowledge-center/glue-tables-cross-accounts to set up access policies in source and target accounts

Add the following "spark.hadoop.hive.metastore.glue.catalogid=" to your conf in the DBT profile, as such, you can have multiple outputs for each of the accounts that you have access to.

Note: The access cross-accounts need to be within the same AWS Region

Profile config example

test_project:
  target: dev
  outputsAccountB:
    dev:
      type: glue
      query-comment: my comment
      role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
      region: eu-west-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "dbt_test_project"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://aws-dbt-glue-datalake-1234567890-eu-west-1/"
      conf: "--conf hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory 
             --conf spark.hadoop.hive.metastore.glue.catalogid=<TARGET-AWS-ACCOUNT-ID-B>"

Persisting model descriptions

Relation-level docs persistence is supported since dbt v0.17.0. For more information on configuring docs persistence, see the docs.

When the persist_docs option is configured appropriately, you'll be able to see model descriptions in the Comment field of describe [table] extended or show table extended in [database] like '*'.

Always `schema`, never `database`

Apache Spark uses the terms "schema" and "database" interchangeably. dbt understands database to exist at a higher level than schema. As such, you should never use or set database as a node config or in the target profile when running dbt-glue.

If you want to control the schema/database in which dbt will materialize models, use the schema config and generate_schema_name macro only. For more information, check the dbt documentation about custom schemas.

AWS Lakeformation integration

The adapter supports AWS Lake Formation tags management enabling you to associate existing tags defined out of dbt-glue to database objects built by dbt-glue (database, table, view, snapshot, incremental models, seeds).

You can enable or disable lf-tags management via config, at model and dbt-project level (disabled by default)
If enabled, lf-tags will be updated on every dbt run. There are table level lf-tags configs and column-level lf-tags configs.
You can specify that you want to drop existing database, table column Lake Formation tags by setting the drop_existing config field to True (False by default, meaning existing tags are kept)
Please note that if the tag you want to associate with the table does not exist, the dbt-glue execution will throw an error

The adapter also supports AWS Lakeformation data cell filtering.

You can enable or disable data-cell filtering via config, at model and dbt-project level (disabled by default)
If enabled, data_cell_filters will be updated on every dbt run.
You can specify that you want to drop existing table data-cell filters by setting the drop_existing config field to True (False by default, meaning existing filters are kept)
You can leverage excluded_columns_names OR columns config fields to perform Column level security as well. Please note that you can use one or the other but not both.
By default, if you don't specify any column or excluded_columns, dbt-glue does not perform Column level filtering and let the principal access all the columns.

The below configuration let the specified principal (lf-data-scientist IAM user) access rows that have a customer_lifetime_value > 15 and all the columns specified ('customer_id', 'first_order', 'most_recent_order', 'number_of_orders')

lf_grants={
        'data_cell_filters': {
            'enabled': True,
            'drop_existing' : True,
            'filters': {
                'the_name_of_my_filter': {
                    'row_filter': 'customer_lifetime_value>15',
                    'principals': ['arn:aws:iam::123456789:user/lf-data-scientist'], 
                    'column_names': ['customer_id', 'first_order', 'most_recent_order', 'number_of_orders']
                }
            }, 
        }
    }

The below configuration let the specified principal (lf-data-scientist IAM user) access rows that have a customer_lifetime_value > 15 and all the columns except the one specified ('first_name')

lf_grants={
        'data_cell_filters': {
            'enabled': True,
            'drop_existing' : True,
            'filters': {
                'the_name_of_my_filter': {
                    'row_filter': 'customer_lifetime_value>15',
                    'principals': ['arn:aws:iam::123456789:user/lf-data-scientist'], 
                    'excluded_column_names': ['first_name']
                }
            }, 
        }
    }

See below some examples of how you can integrate LF Tags management and data cell filtering to your configurations :

At model level

This way of defining your Lakeformation rules is appropriate if you want to handle the tagging and filtering policy at object level. Remember that it overrides any configuration defined at dbt-project level.

{{ config(
    materialized='incremental',
    unique_key="customer_id",
    incremental_strategy='append',
    lf_tags_config={
          'enabled': true,
          'drop_existing' : False,
          'tags_database': 
          {
            'name_of_my_db_tag': 'value_of_my_db_tag'          
            }, 
          'tags_table': 
          {
            'name_of_my_table_tag': 'value_of_my_table_tag'          
            }, 
          'tags_columns': {
            'name_of_my_lf_tag': {
              'value_of_my_tag': ['customer_id', 'customer_lifetime_value', 'dt']
            }}},
    lf_grants={
        'data_cell_filters': {
            'enabled': True,
            'drop_existing' : True,
            'filters': {
                'the_name_of_my_filter': {
                    'row_filter': 'customer_lifetime_value>15',
                    'principals': ['arn:aws:iam::123456789:user/lf-data-scientist'], 
                    'excluded_column_names': ['first_name']
                }
            }, 
        }
    }
) }}

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value,
        current_date() as dt
        
    from customers

    left join customer_orders using (customer_id)

    left join customer_payments using (customer_id)

At dbt-project level

This way you can specify tags and data filtering policy for a particular path in your dbt project (eg. models, seeds, models/model_group1, etc.) This is especially useful for seeds, for which you can't define configuration in the file directly.

seeds:
  +lf_tags_config:
    enabled: true
    tags_table: 
      name_of_my_table_tag: 'value_of_my_table_tag'  
    tags_database: 
      name_of_my_database_tag: 'value_of_my_database_tag'
models:
  +lf_tags_config:
    enabled: true
    drop_existing: True
    tags_database: 
      name_of_my_database_tag: 'value_of_my_database_tag'
    tags_table: 
      name_of_my_table_tag: 'value_of_my_table_tag'

Tests

To perform a functional test:

Install dev requirements:

$ pip3 install -r dev-requirements.txt

Install dev locally

$ python3 setup.py build && python3 setup.py install_lib

Export variables

$ export DBT_AWS_ACCOUNT=123456789101
$ export DBT_GLUE_REGION=us-east-1
$ export DBT_S3_LOCATION=s3://mybucket/myprefix
$ export DBT_GLUE_ROLE_ARN=arn:aws:iam::1234567890:role/GlueInteractiveSessionRole

Caution: Be careful not to set S3 path containing important files. dbt-glue's test suite automatically deletes all the existing files under the S3 path specified in DBT_S3_LOCATION.

Run the test

$ python3 -m pytest tests/functional

$ python3 -m pytest -s

For more information, check the dbt documentation about testing a new adapter.

Caveats

Supported Functionality

Most dbt Core functionality is supported, but some features are only available with Apache Hudi.

Apache Hudi-only features:

Incremental model updates by unique_key instead of partition_by (see merge strategy)

Some dbt features, available on the core adapters, are not yet supported on Glue:

Persisting column-level descriptions as database comments
Snapshots

For more information on dbt:

Read the introduction to dbt.
Read the dbt viewpoint.
Join the dbt community.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

dbt-glue's People

Contributors

Stargazers

Watchers

Forkers

dataders andrusha adithyapathipaka thesekyi teslahenry marcbook louishourcade davinexprime skamalj lumiqai timesking pixie79 nicor88 saulgausin l-jhon mferryrv mmehrten danphenderson lianus menuetb ryandataengineergit mehdimld jduggin-parchment nulledexceptions pyor-xyz piotrek660 amineizanami danielcmessias stufan aws-big-data-projects cch0 binhnq94 yhyyz moomindani hanna-liashchuk riyaadkherekar derek-etherton-opslevel mishradishant sanga8 hash0b aajisaka atifrani ognis1205 brianhtn jjacobj84 airliquide iamepps yaroslav-ost chappidim satyam-k-m ksatyamm dma-rep092 dev-goyal anna-manna eshetben lawofcycles ririumu

dbt-glue's Issues

Glue adapter: Error in GlueCursor execute An error occurred (ThrottlingException) when calling the GetStatement operation

Describe the bug
We are runnning dbtvault jobs on glue and have scheduled the same via airflow to run daily. We are running all the models through cmd commands "dbt run -m models/path/model.sql" and this model is run in a pod, so for each dbtvault models a pod is created and each dbtvault models runs on different pod. the whole process is being done using a for loop which is being implemented in the airflow dag, now all the dbtvault modelss are running simultaeneously. Since all the dbtvault models are running seprately so they are creating a different glue interactive session each in different pods each.
Now what is happening is we are experiencing a throttling exception, which on debugging we got to know that its happening becase we have reached the maximum limit of api calls for interactive session at a time, we increased the limit but still there always seems to be a random model which fails because of the throttling exception. Now the weird thing is though it fails with this throttling exception, it still shows the dbt status to be "ok completed". The problem is not in particular with a specific model, everytime the dag is run a different model fails with this issue

Steps To Reproduce

Connect dbtvault with glue
Run dbtvault models simultaeneously in a way that all of them create different glue interactive sessions
Glue version - 3.0
DBT version - 1.1.2
Spark - 1.1.0

Screenshots and log output

[2022-12-02, 19:30:13 IST] {kubernetes_pod.py:587} INFO - Creating pod run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0 with labels: {'dag_id': 'care_data_model_airflow_dag', 'task_id': 'run_dbt_cmd_ref_clm_cs', 'run_id': 'manual__2022-12-02T135414.2097260000-21e70563e', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2022-12-02, 19:30:13 IST] {kubernetes_pod.py:380} INFO - Found matching pod run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0 with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.4.2', 'dag_id': 'care_data_model_airflow_dag', 'foo': 'bar', 'kubernetes_pod_operator': 'True', 'run_id': 'manual__2022-12-02T135414.2097260000-21e70563e', 'task_id': 'run_dbt_cmd_ref_clm_cs', 'try_number': '1'}
[2022-12-02, 19:30:13 IST] {kubernetes_pod.py:381} INFO - `try_number` of task_instance: 1
[2022-12-02, 19:30:13 IST] {kubernetes_pod.py:382} INFO - `try_number` of pod: 1
[2022-12-02, 19:30:13 IST] {pod_manager.py:180} WARNING - Pod not yet started: run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0
[2022-12-02, 19:30:14 IST] {pod_manager.py:180} WARNING - Pod not yet started: run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0
[2022-12-02, 19:30:15 IST] {pod_manager.py:228} INFO - + cd /apps/dbt_vault
[2022-12-02, 19:30:15 IST] {pod_manager.py:228} INFO - + dbt run -m models/dbtvault_models/reference/ref_clm_cs/dbtv/ref_clm_cs.sql --vars '{"start_date":"1640995200000","end_date":"1669989282999"}'
[2022-12-02, 19:30:16 IST] {pod_manager.py:228} INFO - 14:00:16  Running with dbt=1.1.2
[2022-12-02, 19:30:16 IST] {pod_manager.py:228} INFO - 14:00:16  Unable to do partial parsing because config vars, config profile, or config target have changed
[2022-12-02, 19:30:16 IST] {pod_manager.py:228} INFO - 14:00:16  Unable to do partial parsing because profile has changed
[2022-12-02, 19:30:26 IST] {pod_manager.py:228} INFO - 14:00:26  Found 158 models, 4 tests, 0 snapshots, 0 analyses, 716 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
[2022-12-02, 19:30:26 IST] {pod_manager.py:228} INFO - 14:00:26
[2022-12-02, 19:30:59 IST] {pod_manager.py:228} INFO - 14:00:59  Concurrency: 1 threads (target='prod')
[2022-12-02, 19:30:59 IST] {pod_manager.py:228} INFO - 14:00:59
[2022-12-02, 19:30:59 IST] {pod_manager.py:228} INFO - 14:00:59  1 of 1 START incremental model empower_uat_raw.ref_clm_cs ...................... [RUN]
[2022-12-02, 19:31:13 IST] {pod_manager.py:228} INFO - 14:01:13  Glue adapter: Error in GlueCursor execute An error occurred (ThrottlingException) when calling the GetStatement operation (reached max retries: 4): Rate exceeded
[2022-12-02, 19:31:13 IST] {pod_manager.py:228} INFO - 14:01:13  Glue adapter: __init__() missing 3 required positional arguments: 'cwd', 'cmd', and 'message'
[2022-12-02, 19:31:14 IST] {pod_manager.py:228} INFO - 14:01:14  1 of 1 OK created incremental model empower_uat_raw.ref_clm_cs ................. [�[32mOK�[0m in 15.86s]
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15  Finished running 1 incremental model in 48.33s.
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15  �[32mCompleted successfully�[0m
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15
[2022-12-02, 19:31:15 IST] {pod_manager.py:228} INFO - 14:01:15  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
[2022-12-02, 19:31:25 IST] {pod_manager.py:228} INFO - 14:01:25  Error sending message, disabling tracking
[2022-12-02, 19:31:26 IST] {pod_manager.py:275} INFO - Pod run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0 has phase Running
[2022-12-02, 19:31:28 IST] {kubernetes_pod.py:475} INFO - Deleting pod: run-dbt-cmd-ref-clm-cs-a4ec0e9599314d35ade1019a995bbcf0
[2022-12-02, 19:31:28 IST] {taskinstance.py:1406} INFO - Marking task as SUCCESS. dag_id=care_data_model_airflow_dag, task_id=run_dbt_cmd_ref_clm_cs, execution_date=20221202T135414, start_date=20221202T140011, end_date=20221202T140128
[2022-12-02, 19:31:28 IST] {local_task_job.py:164} INFO - Task exited with return code 0
[2022-12-02, 19:31:28 IST] {local_task_job.py:273} INFO - 1 downstream tasks scheduled from follow-on schedule check

Error using HUDI with dbt-glue: `HoodieException: 'hoodie.table.name' must be set`

Describe the bug

Running dbt run on a simple dbt mode with dbt-glue gives HoodieException: 'hoodie.table.name' must be set when trying to use Apache HUDI.

Steps To Reproduce

HUDI installed via JAR and a custom connector in AWS Glue (code is running in GovCloud where AWS Marketplace extensions are not available).

HUDI Jar
Class name: org.apache.hudi
Connection name: hudi_connection

Profiles.yml:

govcloud_demo:
  outputs:
    dev:
      type: glue
      query-comment: Glue DBT
      role_arn: role
      region: us-gov-west-1
      glue_version: "3.0"
      workers: 2
      worker_type: G.1X
      idle_timeout: 10
      schema: "analytics"
      database: "analytics"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://data/path/"
      connections: hudi_connection
      conf: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
      default_arguments: "--enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://logs/path/"
  target: dev

dbt_project.yml:

name: 'govcloud_demo'
version: '1.0.0'
config-version: 2
profile: 'govcloud_demo'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"
models:
  +file_format: hudi
  govcloud_demo:
    example:
      +materialized: view

Model.sql: (note: Have used file_format=hudi here - same behavior occurs whether configured in dbt_project.yml or in the model file).

{{ config(materialized='table') }}
with source_data as (
    select 1 as id,
    "b" AS anothercol
)
select *
from source_data

Expected behavior

Model runs and creates in Glue catalog / S3.

Screenshots and log output

22:45:01      '''), Py4JJavaError: An error occurred while calling o86.sql.
22:45:01    : org.apache.hudi.exception.HoodieException: 'hoodie.table.name' must be set.
22:45:01        at org.apache.hudi.common.config.HoodieConfig.getStringOrThrow(HoodieConfig.java:237)
22:45:01        at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:98)
22:45:01        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
22:45:01        at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:530)
...

System information

The output of dbt --version:

Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - spark: 1.3.0 - Up to date!

The operating system you're using: MacOS

The output of python --version: Python 3.10.8

Threading Docs Generation - backoff

Describe the bug

We have a reasonable number of models and source tables around 40/30 of each. Generating the docs takes a long time. During testing of this I tried to increase the threads to 20. Considering that most of these commands are individual boto describes against Glue rather than Glue sessions.

What happened was the following error:

Glue adapter: Error in GlueCursor execute An error occurred (ThrottlingException) when calling the RunStatement operation (reached max retries: 4): Rate exceeded

I received the same error even with 4 threads.

Expected behavior

I would have expected the API not to have over run the API limit in the first place but if it did then back off and retry appropriately rather than fail.

System information

The output of dbt --version:

 poetry run dbt --version
Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - redshift: 1.3.0 - Up to date!
  - postgres: 1.3.1 - Up to date!
  - spark:    1.3.0 - Up to date!

**The operating system you're using: AWS ECS Fargate Docker container

**The output of python --version: Python 3.10.7

Additional context

Add any other context about the problem here.

Queries that finishes with single quote fails

Describe the bug

If a query finishes with single quote ('), the execution of the query will fail.

Steps To Reproduce

Execute a query that ends with single quote (') Ex: WHERE column='foo'

Expected behavior

A query succesful executed

Screenshots and log output

Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''create table dbt_nyc_metrics.gold_nyctaxi_cost_metrics
      using PARQUET
      
      LOCATION 's3://bucket/dbt_nyc_metrics/gold_nyctaxi_cost_metrics/'
      
      as
        SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
  , (avg_total_amount/avg_duration) as avg_cost_per_minute
  , year
  , month 
  , type
  FROM dbt_nyc_metrics.silver_nyctaxi_avg_metrics
  WHERE type = 'yellow''''), SyntaxError: EOL while scanning string literal (<stdin>, line 14)
  compiled SQL at target/run/dbtgluenyctaxidemo/models/gold_metrics/gold_nyctaxi_cost_metrics.sql
21:23:36.132301 [debug] [Thread-1  ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '8f256c47-d137-4a73-9765-61ad270e3900', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f2918b7df10>]}
21:23:36.133364 [error] [Thread-1  ]: 1 of 3 ERROR creating table model dbt_nyc_metrics.gold_nyctaxi_cost_metrics .... [ERROR in 21.39s]

System information

The output of dbt --version:

dbt --version
Core:
  - installed: 1.2.1
  - latest:    1.2.2 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 1.2.0 - Up to date!

The operating system you're using:
MacOS Monterey
The output of python --version:
Python 3.7.10

Schema character

Hello All,
We are working on the POC to integrate dbt with our data lake on AWS with glue adapter and trying to implement a model that can load data to the s3 location with the below table config where the database name and DB s3 location name are not the same,
table_name: dbt_table
Database: det_landing_data_platform
S3 Database location: s3://det-landing-data-lake/data-platform
S3 table Data Location: s3://det-landing-data-lake/data-platform/dbt_table

Profiles:
aws_demo_proj:
target: aws_dev
outputs:
aws_dev:
type: glue
query-comment: This is a glue dbt example
role_arn: arn:aws:iam::xxxxxxxxxxx:role/dbt-demo-role
region: eu-central-1
workers: 5
worker_type: G.1X
schema: "data-platform"
database: "det_landing_data_platform"
session_provisioning_timeout_in_seconds: 120
location: "s3://det-landing-data-lake"

When we run the dbt, execution fails with the error "ParseException: "\nmismatched input '-' expecting " in schema name. As I understand database for glue in dbt is considered a schema name
in the dbt profile and database, the variable is not used anywhere in the model.
But when we change the profile configuration as below,
Profiles:
aws_demo_proj:
target: aws_dev
outputs:
aws_dev:
type: glue
query-comment: This is a glue dbt example
role_arn: arn:aws:iam::xxxxxxxxxxx:role/dbt-demo-role
region: eu-central-1
workers: 5
worker_type: G.1X
schema: "det_landing_data_platform"
database: "det_landing_data_platform"
session_provisioning_timeout_in_seconds: 120
location: "s3://det-landing-data-lake/data-platform"
It creates a table path as "s3://det-landing-data-lake/data-platform/det_landing_data_platform/dbt_table" which is not what we want as it is appending the schema name to the location, and the location name of the DB is different with DB name in our scenario.
How ever when we change the schema name from "data-platform" to "data_platform", it worked.
Can anyone please help us out with how we can set the profile configuration where we can store the data of the right path as "s3://det-landing-data-lake/data-platform/<table_name>"?

Pagination Bug

Bug

The adapter code isn't iterating while NextToken is returned from boto3.client.glue response. This will surely blow up at scale.

Example

Consider the function defined in dbt-glue/blob/main/dbt/adapters/glue/impl.py:

    def list_schemas(self, database: str) -> List[str]:
        session, client, cursor = self.get_connection()
        responseGetDatabases = client.get_databases()
        databaseList = responseGetDatabases['DatabaseList']
        schemas = []
        for databaseDict in databaseList:
            databaseName = databaseDict['Name']
            schemas.append(databaseName)
        return schemas

The definition above should check to see if it needs to make a continuation call, i.e. when responseGetDatabases['NextToken'] is not None.

Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_databases

Solution

In the example above, databaseList needs to be created over all pages. The bug appears in other locations.

Fix default glue version on documentation

Describe the bug

Readme document has the wrong glue version as default. Also a few typos

Expected behavior

Show the right information on the readme

Not able to query dbt-glue Iceberg Table with Athena Engine version 3

Describe the bug

Currently not able to query Iceberg tables created by dbt-glue adapter with Athena engine version 3

Steps To Reproduce

Create an Iceberg table using dbt-glue
Try to query this table from Athena using a version 3 Athena Query Engine

Screenshots and log output

Additional context

This error can be related to the Iceberg Connector version (the only supported version is currently the 0.12.0-2 software version)

Exceptions updating seeds are swallowed

Describe the bug

Hi folks, thanks for the hard work on the glue adapter.

I noticed that exceptions are swallowed when updating seeds because of a blanket bare except.

https://github.com/aws-samples/dbt-glue/blob/3dc41e27e92c83d75c606facc5836cbe7f488047/dbt/adapters/glue/impl.py#L407:L419

Steps To Reproduce

Expected behavior

The exception bubbles up instead of trying to re-create the table.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - spark: 1.3.0 - Up to date!

The operating system you're using:

MacOS

The output of python --version:

Python 3.9.15

Additional context

Add any other context about the problem here.

Duplicates records with using incremental with hudi merge and unique_key and partitions

Describe the bug

I found a unexpected behaviour with hudi and incremental loads with unique_key and merge strategy, when using partitions.

If the partition of a record change the run lead to duplicates, as the same key used for uniqueness appears multiple times, one with then old partition, and then with the new partition value.
The expected behaviour is that the record is replaced and its new partition take effect.

Steps To Reproduce

Create a model with this content

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'A' AS user_id, 'active' AS status
	UNION ALL
	SELECT 'B' AS user_id, 'active' AS status
	UNION ALL
	SELECT 'C' AS user_id, 'disabled' AS status
)

SELECT
	user_id,
	status,
	current_timestamp() AS inserted_at
FROM data

Run the above model. Then change the model to this one:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'C' AS user_id, 'disabled' AS status
)

SELECT
	user_id,
	status,
	current_timestamp() AS inserted_at
FROM data

run the updated model.

For the user_id='C' we expect only one record with the partition status='disabled'`, but 2 records will be returned.

Expected behaviour

When using materialisation incremental with unique_keys, the model should not produce duplicates. Hence queries like:

select my_unique_id, count(*) as c
from my_model
group by 1
having c > 1

Should give an empty result, and this is not the case.

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: MacOS BigSur

The output of python --version: Python 3.9.6

Additional context

~~I believe that the issue is in here, as if the table already exist, we should just overwrite~~
Not sure if this behaviour is related to the dbt-adapter or to Hudi itself, see this
also a possible solution to consider can be found here

Glue Interactive sessions are not being stopped

Describe the bug

I ran a dbt project and the session didn't stop at the end of the run. It timed out thanks to the configuration on the profile, but it didn't stop.

Steps To Reproduce

Run a project until it finishes.
Go to the AWS console -> Glue -> Interactive Sessions and you will see that your session is ready and then it will change status to Timeout

Expected behavior

dbt-glue should stop the Glue Interactive session at the end of the run

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.2.2

spark.driver.maxResultSize default to 1 Gb after HUDI non-partition load

Describe the feature

Be able to modify the parameter spark.driver.maxResultSize that currently default is 1 Gb after HUDI non-partition load

Are you interested in contributing this feature?

Yes

Java Error when building models

Describe the bug

Would like to know if this issue is known and if I can get some help.
After running deploying the cloudformation set of items, and configuring the repo and running "dbt run --profiles-dir profile" the models start to build and fail. Debug runs fine.

Steps To Reproduce

Build the repo,
deploy the stack with cloud formation
export env vars
run: dbt run --profiles-dir profile

Expected behavior

Models should be able to build and finish successfully

### Screenshots and log output

'''), Py4JError: An error occurred while calling o91.toString. Trace:

18:25:24 java.lang.IllegalArgumentException: object is not an instance of declaring class
18:25:24 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
18:25:24 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
18:25:24 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
18:25:24 at java.lang.reflect.Method.invoke(Method.java:498)
18:25:24 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
18:25:24 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
18:25:24 at py4j.Gateway.invoke(Gateway.java:282)
18:25:24 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
18:25:24 at py4j.commands.CallCommand.execute(CallCommand.java:79)
18:25:24 at py4j.GatewayConnection.run(GatewayConnection.java:238)
18:25:24 at java.lang.Thread.run(Thread.java:750)
18:25:24
18:25:24
18:25:24 compiled Code at target/run/dbtgluenyctaxidemo/models/silver_metrics/silver_nyctaxi_avg_metrics.sql

System information

The output of dbt --version:

Core:
  - installed: 1.3.0
  - latest:    1.3.0 - Up to date!

Plugins:
  - spark: 1.3.0 - Up to date!

The operating system you're using:
Macos

The output of python --version:
Python 3.8.9

Additional context

Just wondering if you guys have seen this error before.

Cross account access for dbt sources

Describe the feature

We are planning to use DBT Glue which sources tables from other accounts in the organization. It seems dbt glue doesn't support/receive catalog id for cross account

Describe alternatives you've considered

As this feature is not supported, we can't use DBT glue. Instead we are using Glue code in PySpark itself.

Who will this benefit?

Glue jobs which requires cross account access for source data

Delta Lake Support for DBT Glue

Do dbt-glue support delta lake base load? I am trying to transform data using delta lake but failing at save.

File Format JSON - Table fails to register

Describe the bug

When you set the file_format to JSON and complete a DBT run, the JSON files get written to the correct path on S3 however, the glue data catalogue fails to get updated with the new table.

Steps To Reproduce

Set the model properties as follows:

{{ config(
    partition_by=['year','month','day','hour'],
    file_format='json',
    materialized='table',
    custom_location='s3://' + var('AWS_ACCOUNT_ID') + '-' + var('AWS_REGION') + '-partner-data/egress/posting_event'
) }}

Expected behavior

Like when the file_format is parquet or hudi I would expect the process to complete without error and register my new table in the glue data catalogue.

Screenshots and log output

Error returned to the console is:
IllegalArgumentException: Can not create a Path from an empty string

System information

The output of dbt --version:

Core:
  - installed: 1.3.0
  - latest:    1.3.0 - Up to date!

Plugins:
  - redshift: 1.3.0 - Up to date!
  - postgres: 1.3.0 - Up to date!
  - spark:    1.3.0 - Up to date!

The operating system you're using:

The output of python --version:

Python 3.10.8

Additional context

Add any other context about the problem here.

New columns (on schema change) not supported when working with hudi/incremental

Describe the bug

Adding new columns when working with hudi/incremental lead to some issues.

Steps To Reproduce

Create a model like this:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'A' AS user_id, 'active' AS status, 'pi' AS name
	UNION ALL
	SELECT 'B' AS user_id, 'active' AS status, 'sh' AS name
	UNION ALL
	SELECT 'C' AS user_id, 'active' AS status, 'zh' AS name
)

SELECT
	user_id,
	status,
	name,
	current_timestamp() AS inserted_at
FROM data

run the model. Then change the model to:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'A' AS user_id, 'active' AS status, 'pi' AS name, 'test' AS col_a
	UNION ALL
	SELECT 'B' AS user_id, 'active' AS status, 'sh' AS name, 'test' AS col_a
	UNION ALL
	SELECT 'C' AS user_id, 'active' AS status, 'zh' AS name, 'test' AS col_a
)

SELECT
	user_id,
	status,
	name,
        col_a,
	current_timestamp() AS inserted_at
FROM data

After running the model I' getting this error:

Caused by: java.io.IOException: Required column is missing in data file. Col: [col_a]
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.validateRequestedSchemaIsSupported(VectorizedParquetRecordReader.java:331)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildPrefetcherWithPartitionValues$1(ParquetFileFormat.scala:543)
        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:73)
        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:72)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more

14:49:22  1 of 1 ERROR creating incremental model lakehouse_silver.test_hudi ............. [ERROR in 56.69s]
14:49:22  
14:49:22  Finished running 1 incremental model in 0 hours 1 minutes and 31.32 seconds (91.32s).
14:49:23  
14:49:23  Completed with 1 error and 0 warnings:
14:49:23  
14:49:23  Database Error in model test_hudi (models/silver/test_hudi.sql)
14:49:23    Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''/* {"app": "dbt", "dbt_version": "1.2.1", "profile_name": "glue_lakehouse", "target_name": "dev", "node_id": "model.lakehouse.test_hudi"} */
14:49:23    select * from lakehouse_silver.test_hudi limit 1 '''), Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
14:49:23    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 57.0 failed 4 times, most recent failure: Lost task 1.3 in stage 57.0 (TID 397) (172.36.209.233 executor 2): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://nicor88-lakehouse/lakehouse_silver/test_hudi/changed/1c2f729b-e21f-4bdd-a62f-76372dcc9fd9-0_2-41-355_20221014144605641.parquet, range: 0-432654, partition values: [changed], isDataPresent: false, eTag: f2e1d7c8388a7d555915c21cdd27c442
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:279)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:201)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
14:49:23        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:606)
14:49:23        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
14:49:23        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
14:49:23        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
14:49:23        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
14:49:23        at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:262)
14:49:23        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
14:49:23        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:133)
14:49:23        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
14:49:23        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
14:49:23        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
14:49:23        at org.apache.spark.scheduler.Task.run(Task.scala:131)
14:49:23        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
14:49:23        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
14:49:23        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
14:49:23        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
14:49:23        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
14:49:23        at java.lang.Thread.run(Thread.java:750)
14:49:23    Caused by: java.io.IOException: Required column is missing in data file. Col: [col_a]
14:49:23        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.validateRequestedSchemaIsSupported(VectorizedParquetRecordReader.java:331)
14:49:23        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildPrefetcherWithPartitionValues$1(ParquetFileFormat.scala:543)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:73)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:72)
14:49:23        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
14:49:23        ... 3 more
14:49:23    
14:49:23    Driver stacktrace:
14:49:23        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2465)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2414)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2413)
14:49:23        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
14:49:23        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
14:49:23        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2413)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
14:49:23        at scala.Option.foreach(Option.scala:257)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
14:49:23        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2679)
14:49:23        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)
14:49:23        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)
14:49:23        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
14:49:23        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
14:49:23        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
14:49:23        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
14:49:23        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
14:49:23        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
14:49:23        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
14:49:23        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
14:49:23        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
14:49:23        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
14:49:23        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
14:49:23        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
14:49:23        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
14:49:23        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
14:49:23        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
14:49:23        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
14:49:23        at java.lang.reflect.Method.invoke(Method.java:498)
14:49:23        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
14:49:23        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
14:49:23        at py4j.Gateway.invoke(Gateway.java:282)
14:49:23        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
14:49:23        at py4j.commands.CallCommand.execute(CallCommand.java:79)
14:49:23        at py4j.GatewayConnection.run(GatewayConnection.java:238)
14:49:23        at java.lang.Thread.run(Thread.java:750)
14:49:23    Caused by: org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://grover-eu-central-1-dev-data-lakehouse/lakehouse_silver/test_hudi/changed/1c2f729b-e21f-4bdd-a62f-76372dcc9fd9-0_2-41-355_20221014144605641.parquet, range: 0-432654, partition values: [changed], isDataPresent: false, eTag: f2e1d7c8388a7d555915c21cdd27c442
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:279)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:201)
14:49:23        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
14:49:23        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:606)
14:49:23        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
14:49:23        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
14:49:23        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
14:49:23        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
14:49:23        at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:262)
14:49:23        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
14:49:23        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:133)
14:49:23        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
14:49:23        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
14:49:23        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
14:49:23        at org.apache.spark.scheduler.Task.run(Task.scala:131)
14:49:23        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
14:49:23        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
14:49:23        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
14:49:23        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
14:49:23        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
14:49:23        ... 1 more
14:49:23    Caused by: java.io.IOException: Required column is missing in data file. Col: [col_a]
14:49:23        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.validateRequestedSchemaIsSupported(VectorizedParquetRecordReader.java:331)
14:49:23        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildPrefetcherWithPartitionValues$1(ParquetFileFormat.scala:543)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:73)
14:49:23        at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:72)
14:49:23        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
14:49:23        ... 3 more
14:49:23    
14:49:23    compiled SQL at target/run/lakehouse/models/silver/test_hudi.sql
14:49:23  
14:49:23  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

As workaround, if I drop the table from glue, and re-run the model it works. But imagine a scenario where analysts add a new field, this will lead to easily braking behavior.

Expected behavior

New col_a, added in the 2nd run will be added to the table.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: MacOS BigSur

The output of python --version: 3.9.6

Additional context

This could be a limitation of using hudi, as doesn't support schema changes.
The above seems not correct as schema changes seems supported. See here

Unable to make the incremental model work with insert_overwrite strategy

Describe the bug

I'm having trouble executing models with materialzed=incremental in dbt with glue spark. Incremental condition is not adding to sql query in is_incremental block.

Profile:

aws_demo_proj:
  target: aws_dev
  outputs:
    aws_dev:
      type: glue
      query-comment: This is a glue dbt example
      role_arn: arn:aws:iam::xxxxxxxx:role/dbt-demo-role
      region: eu-central-1
      workers: 5
      worker_type: G.2X
      schema: "det"
      database: "det_landing_data_platform"
      session_provisioning_timeout_in_seconds: 120
      location: "s3a://det-landing-data-lake"
      idle_timeout: 120

Incremental Model Config and SQL:

{{ config(
    materialized='incremental',
    custom_location='s3a://det-published-data-lake/booking-solutions/fulfilment_view_data',
    incremental_strategy='insert_overwrite',
    schema='published_booking_solutions',
    file_format='parquet',
    partition_by=['business_date']

) }}

SELECT new_process_id, booking_source,
((booked_restored_gross_price_hotel - new_restored_gross_price_hotel) * new_restored_gross_price_customer / new_restored_gross_price_hotel) as savings,
(booked_restored_gross_price_hotel * new_restored_gross_price_customer / new_restored_gross_price_hotel) as booked_restored_gross_price_customer,
new_restored_gross_price_customer,
booked_price_rate_type,
init_booking_multisourced,
rebooking_multisourced,
initial_crs_type,
DATE(concat_ws('-', year, month, day)) as business_date
FROM prod_landing_booking_solutions.fulfilment_view
WHERE
    -- REQUIRED TO OMIT DIVISION BY ZERO value of new_restored_gross_price_hotel
    new_restored_gross_price_hotel > 0
    AND new_process_id IS NOT NULL
    AND booked_restored_gross_price_hotel > new_restored_gross_price_hotel
    {% if is_incremental() %}
    and  DATE(concat_ws('-', year,month, day)) >= (select max(business_date) from  {{this}} )
    {% endif %}

Steps To Reproduce

Rerun the dbt model and check the query generated.

Expected behavior

Incremental conditions need to be added to the SQL query from the next of table creation.

Screenshots and log output

incremental_first_load.log
incremental_second_load.log

System information

The output of dbt --version:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - spark: 1.1.0 - Up to date!
  - glue: 0.2.0 - Up to date!

The operating system you're using: Mac Os

The output of python --version:

Python 3.8.8

Additional context

Add any other context about the problem here.

Error when running models at second time (materialized='table')

Describe the bug

I try to run the models with the following this tutorial. It's successful at the first time, but failed when I run it again

Steps To Reproduce

Following this tutorial
Run the command dbt run --profiles-dir profile in terminal
Run the command again --> Failed.

Expected behavior

The job should be successful and the tables will be re-created.

Screenshots and log output

Error message as below:

(dbt_venv) [cloudshell-user@ip-10-0-73-218 dbtgluenyctaxidemo]$ dbt run --profiles-dir profile
16:00:47  Running with dbt=1.1.1
16:00:47  Found 4 models, 0 tests, 0 snapshots, 0 analyses, 225 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
16:00:47  
16:02:19  Concurrency: 1 threads (target='dev')
16:02:19  
16:02:19  1 of 4 START table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ............ [RUN]
16:02:22  Unhandled error while executing model.dbtgluenyctaxidemo.silver_nyctaxi_avg_metrics
Field "type" of type Optional[RelationType] in SparkRelation has invalid value Undefined
16:02:22  1 of 4 ERROR creating table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ... [ERROR in 2.09s]
16:02:22  2 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_cost_metrics ................. [SKIP]
16:02:22  3 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_distance_metrics ............. [SKIP]
16:02:22  4 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_passengers_metrics ........... [SKIP]
16:02:24  
16:02:24  Finished running 4 table models in 96.53s.
16:02:24  
16:02:24  Completed with 1 error and 0 warnings:
16:02:24  
16:02:24  Field "type" of type Optional[RelationType] in SparkRelation has invalid value Undefined
16:02:24  
16:02:24  Done. PASS=0 WARN=0 ERROR=1 SKIP=3 TOTAL=4

System information

The output of dbt --version:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - spark: 1.1.0 - Up to date!
  - spark: 1.1.0 - Up to date!

The operating system you're using: AWS ClouShell
The output of python --version: Python 3.7.10

Additional context

Error when creating snapshots

Describe the bug

When creating snapshots, I get the error that the target table for the snapshots to be loaded into does not exist. This causes a failure in the dbt snapshot execution.

Steps To Reproduce

Create a snapshots directory in the dbt project folder
add a file demosnapshot.sql with the content as seen below:

{% snapshot demosnapshot %}
    {{
        config(
            strategy='timestamp',
            target_schema='mytargetdatabase',
            target_database='mytargetdatabase',
            unique_key='unique_id',
            updated_at='created_date',
            file_format='hudi'
        )
    }}
    select * from {{ source('data_source', 'events_data') }}
{% endsnapshot %}

Add IAM Role with Lake Formation permissions used by Glue Interactive Sessions

  DataLakeDPCleanedTablesPermissions:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier: !GetAtt IamGlueRole.Arn
      Resource:
        TableResource:
          DatabaseName: !Sub mytargetdatabase
          TableWildcard: { }
      Permissions:
        - DESCRIBE
        - SELECT
        - INSERT
        - DROP

  DataLakeDPCleanedDatabasePermissions:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier: !GetAtt IamGlueRole.Arn
      Resource:
        DatabaseResource:
          Name: !Sub mytargetdatabase
      Permissions:
        - CREATE_TABLE
        - ALTER
        - DROP
        - DESCRIBE

Run the command $ dbt snapshot in the terminal

Expected behavior

According to the dbt example on snapshots, no error is expected

Screenshots and log output

dbt.log output below

06:30:57.855066 [debug] [Thread-1  ]: Glue adapter: GlueConnection cursor called
06:30:58.151902 [error] [Thread-1  ]: Glue adapter: An error occurred (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on demosnapshot
06:30:58.152336 [error] [Thread-1  ]: Glue adapter: relation mytargetdatabase.demosnapshot not found
06:30:58.152817 [error] [Thread-1  ]: Glue adapter: name 'list_schemas' is not defined
06:30:58.153291 [error] [Thread-1  ]: Glue adapter: check_schema_exists exception
06:30:58.156281 [debug] [Thread-1  ]: finished collecting timing info
06:30:58.156535 [debug] [Thread-1  ]: On snapshot.datalakedbtdemo.demosnapshot: Close
06:30:58.156717 [debug] [Thread-1  ]: Glue adapter: NotImplemented: close

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.0 - Up to date!

Plugins:
  - bigquery:  1.1.0 - Up to date!
  - snowflake: 1.1.0 - Up to date!
  - redshift:  1.1.0 - Up to date!
  - postgres:  1.1.0 - Up to date!

The operating system you're using: MacOS

The output of python --version: python3

Additional context

Definitely, the snapshot table would not exist and permissions have been given to the Glue Interactive Sessions IAM role to create tables.

Glue Session is Terminating before the Catalog/Manifest file is generated

Describe the bug

DBT Doc generate command is failing as the glue session is terminating after the models check is completed and command fails

Steps To Reproduce

DBT Project available
Execute dbt --debug --log-format text docs generate --profiles-dir profiles --project-dir .

Expected behavior

Glue Session should terminate only after the manifest/catalog files are generated

Screenshots and log output

2:11:28.481422 [debug] [Thread-1  ]: finished collecting timing info
12:11:28.481993 [debug] [Thread-1  ]: Began executing node model.hrs_data_lake.rebooking_previous_year_cost_center
12:11:28.482540 [debug] [Thread-1  ]: finished collecting timing info
12:11:28.483081 [debug] [Thread-1  ]: On model.hrs_data_lake.rebooking_previous_year_cost_center: Close
12:11:28.483605 [debug] [Thread-1  ]: Glue adapter: NotImplemented: close
12:11:28.484617 [debug] [Thread-1  ]: Finished running node model.hrs_data_lake.rebooking_previous_year_cost_center
12:11:28.486135 [debug] [MainThread]: Glue adapter: cleanup called
12:11:28.638774 [info ] [MainThread]: Done.
12:11:28.640883 [debug] [MainThread]: Acquiring new glue connection "generate_catalog"
12:11:28.641580 [info ] [MainThread]: Building catalog
12:11:28.653540 [debug] [ThreadPool]: Acquiring new glue connection "det_cleaned_coe_data_and_ai"
12:11:28.654383 [debug] [ThreadPool]: Opening a new connection, currently in state closed
12:11:28.655957 [debug] [ThreadPool]: Glue adapter: GlueConnection connect called
12:11:28.713079 [debug] [ThreadPool]: Glue adapter: Existing session with status : STOPPING
12:11:28.742260 [debug] [ThreadPool]: Glue adapter: GlueConnection _init_session called
12:11:28.742949 [debug] [ThreadPool]: Glue adapter: GlueConnection session_id : det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa
12:11:28.782975 [error] [ThreadPool]: Glue adapter: Error in GlueCursor execute An error occurred (InvalidInputException) when calling the RunStatement operation: Session is not ready, session_id=det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa, status=CANCELLING
12:11:28.783619 [error] [ThreadPool]: Glue adapter: Got an error when attempting to open a GlueSession : __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:28.784474 [debug] [ThreadPool]: On det_cleaned_coe_data_and_ai: No close available on handle
12:11:28.786027 [debug] [ThreadPool]: Acquiring new glue connection "det_landing_booking_solutions"
12:11:28.787078 [warn ] [MainThread]: Encountered an error while generating catalog: Database Error
  Got an error when attempting to open a GlueSessions: __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:28.787507 [debug] [ThreadPool]: Opening a new connection, currently in state closed
12:11:28.788219 [debug] [ThreadPool]: Glue adapter: GlueConnection connect called
12:11:28.834844 [debug] [ThreadPool]: Glue adapter: Existing session with status : STOPPING
12:11:28.862500 [debug] [ThreadPool]: Glue adapter: GlueConnection _init_session called
12:11:28.863223 [debug] [ThreadPool]: Glue adapter: GlueConnection session_id : det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa
12:11:28.927459 [error] [ThreadPool]: Glue adapter: Error in GlueCursor execute An error occurred (InvalidInputException) when calling the RunStatement operation: Session is not ready, session_id=det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa, status=CANCELLING
12:11:28.928106 [error] [ThreadPool]: Glue adapter: Got an error when attempting to open a GlueSession : __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:28.928956 [debug] [ThreadPool]: On det_landing_booking_solutions: No close available on handle
12:11:28.930498 [debug] [ThreadPool]: Acquiring new glue connection "det_landing_itelya"
12:11:28.931252 [warn ] [MainThread]: Encountered an error while generating catalog: Database Error
  Got an error when attempting to open a GlueSessions: __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:28.931683 [debug] [ThreadPool]: Opening a new connection, currently in state closed
12:11:28.932415 [debug] [ThreadPool]: Glue adapter: GlueConnection connect called
12:11:28.987260 [debug] [ThreadPool]: Glue adapter: Existing session with status : STOPPING
12:11:29.016899 [debug] [ThreadPool]: Glue adapter: GlueConnection _init_session called
12:11:29.017451 [debug] [ThreadPool]: Glue adapter: GlueConnection session_id : det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa
12:11:29.066314 [error] [ThreadPool]: Glue adapter: Error in GlueCursor execute An error occurred (InvalidInputException) when calling the RunStatement operation: Session is not ready, session_id=det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa, status=CANCELLING
12:11:29.066930 [error] [ThreadPool]: Glue adapter: Got an error when attempting to open a GlueSession : __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.067732 [debug] [ThreadPool]: On det_landing_itelya: No close available on handle
12:11:29.069254 [debug] [ThreadPool]: Acquiring new glue connection "prod_data_platform"
12:11:29.070209 [warn ] [MainThread]: Encountered an error while generating catalog: Database Error
  Got an error when attempting to open a GlueSessions: __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.070585 [debug] [ThreadPool]: Opening a new connection, currently in state closed
12:11:29.071291 [debug] [ThreadPool]: Glue adapter: GlueConnection connect called
12:11:29.110890 [debug] [ThreadPool]: Glue adapter: Existing session with status : STOPPING
12:11:29.142979 [debug] [ThreadPool]: Glue adapter: GlueConnection _init_session called
12:11:29.143556 [debug] [ThreadPool]: Glue adapter: GlueConnection session_id : det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa
12:11:29.198718 [error] [ThreadPool]: Glue adapter: Error in GlueCursor execute An error occurred (InvalidInputException) when calling the RunStatement operation: Session is not ready, session_id=det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa, status=CANCELLING
12:11:29.199293 [error] [ThreadPool]: Glue adapter: Got an error when attempting to open a GlueSession : __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.200115 [debug] [ThreadPool]: On prod_data_platform: No close available on handle
12:11:29.201478 [debug] [ThreadPool]: Acquiring new glue connection "prod_landing_crm"
12:11:29.202520 [debug] [ThreadPool]: Opening a new connection, currently in state closed
12:11:29.203144 [warn ] [MainThread]: Encountered an error while generating catalog: Database Error
  Got an error when attempting to open a GlueSessions: __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.203510 [debug] [ThreadPool]: Glue adapter: GlueConnection connect called
12:11:29.243523 [debug] [ThreadPool]: Glue adapter: Existing session with status : STOPPING
12:11:29.272915 [debug] [ThreadPool]: Glue adapter: GlueConnection _init_session called
12:11:29.273490 [debug] [ThreadPool]: Glue adapter: GlueConnection session_id : det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa
12:11:29.332791 [error] [ThreadPool]: Glue adapter: Error in GlueCursor execute An error occurred (InvalidInputException) when calling the RunStatement operation: Session is not ready, session_id=det-dbt-role-DataPlatform-dbt-glue-9aef80ee-db70-47be-af07-a8b625ec42fa, status=CANCELLING
12:11:29.333370 [error] [ThreadPool]: Glue adapter: Got an error when attempting to open a GlueSession : __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.334149 [debug] [ThreadPool]: On prod_landing_crm: No close available on handle
12:11:29.335231 [warn ] [MainThread]: Encountered an error while generating catalog: Database Error
  Got an error when attempting to open a GlueSessions: __init__() missing 2 required positional arguments: 'cmd' and 'message'
12:11:29.353635 [error] [MainThread]: dbt encountered 5 failures while writing the catalog
12:11:29.354366 [info ] [MainThread]: Catalog written to /tmp/dbt/target/catalog.json
12:11:29.358463 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff94f7d580>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff94d9e1c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff95006670>]}
12:11:29.359382 [debug] [MainThread]: Flushing usage events
12:11:29.757419 [debug] [MainThread]: Glue adapter: cleanup called
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 35068), raddr=('3.67.147.184', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 35734), raddr=('3.67.147.184', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=11, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 54878), raddr=('18.198.196.34', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 46734), raddr=('52.29.165.100', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 35108), raddr=('3.67.147.184', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 46754), raddr=('52.29.165.100', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=15, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 54588), raddr=('18.159.90.183', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=16, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 33282), raddr=('18.159.90.183', 443)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=12, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.17.0.3', 54892), raddr=('18.198.196.34', 443)>

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!
  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation
Plugins:
  - spark: 1.2.0 - Update available!
  - dbt-glue: 0.2.6

The operating system you're using:
Ubuntu 12

The output of python --version:
Python 3.8.15

Additional context

Profile Info:

hrs-data-lake:
  target: hrs-data-lake
  outputs:
    hrs-data-lake:
      type: glue
      query-comment: This is a glue dbt example
      role_arn: arn:aws:iam::xxxxxx:role/det-dbt-role-DataPlatform
      region: eu-central-1
      workers: 5
      worker_type: G.2X
      idle_timeout: 600
      schema: "det"
      database: "det_landing_data_platform"
      session_provisioning_timeout_in_seconds: 600
      location: "s3a://det-landing-data-lake/data-platform"
      glue_version: "2.0"

Provide options to add hudi configuration to dbt models Ex. Table type, precombinefield

Describe the feature

Feature request is to have generic way to add hudi configurations for data models. At present almost all options are hardcoded in the code except primary key and partition key. Something similar to below
hudi_options={
'hoodie.datasource.write.table.type':'COPY_ON_WRITE',
'hoodie.datasource.write.precombine.field': 'eventtime',
'hoodie.datasource.write.operation':'upsert'
},

Describe alternatives you've considered

None were considered.

Additional context

I investigated this when I found precombineKey was not taking effect. This is hardcoded in the code as below.
'hoodie.datasource.write.precombine.field': 'update_hudi_ts'
My suggestion is to allow override these with separate configuration block for Hudi.

Who will this benefit?

Any usecase where datalake build is being done using HUDI.

Are you interested in contributing this feature?

Yes. I have already tested this on my local setup.

Add iceberg as a file_format for snapshot

Describe the feature

Currently only hudi and and delta file format are accepted for snapshots. As iceberg are now part of the accepted file format for the adapter I'd like to consider it for snapshots too.

Describe alternatives you've considered

I tried using time travel queries on Athena based on the timestamp provided in the metadata by iceberg but this is not the ideal state and I con only use version travel for now.

Additional context

Please include any other relevant context here.

Who will this benefit?

My team is migrating 100's of models from the Bigquery adapter to AWS Glue.

Are you interested in contributing this feature?

I'd be willing to write some code if needed, with probably some guidance from the Slack channel

Support of partition with Apache HUDI

Describe the feature

Support of partition with Apache HUDI.

Who will this benefit?

Users that would like to use partitions with Apache HUDI

Are you interested in contributing this feature?

Yes

upgrade to support dbt-core v1.3.0

Background

The latest release cut for 1.3.0, dbt-core==1.3.0rc2 was published on October 3, 2022 (PyPI | Github). We are targeting releasing the official cut of 1.3.0 in time for the week of October 16 (in time for Coalesce conference).

We're trying to establish a following precedent w.r.t. minor versions:
Partner adapter maintainers release their adapter's minor version within four weeks of the initial RC being released. Given the delay on our side in notifying you, we'd like to set a target date of November 7 (four weeks from today) for maintainers to release their minor version

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Support for most recent software versions of Iceberg Connectors from AWS Marketplace

Describe the feature

Support more recent/any version of the Iceberg Connector from AWS Marketplace.

Additional context

As a result of the #132 merge, the Iceberg incremental materialization has been re-designed. This implies that Iceberg Connectors from AWS Marketplace more recent than the Software version 0.12.0-2 (Feb 14, 2022) are no longer supported (they were in the previous Iceberg 'Table' materialization).

Are you interested in contributing this feature?

Yes

Add how to enable auto scaling on documentation

Describe the feature

With the changes introduced to add monitoring to the glue interactive sessions, we can also enable auto scaling. Adding the documentation could help to use this nice feature.

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

upgrade to support dbt-core v1.3.0

Background

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Error running dbt run --profiles-dir profile

I am trying to follow this article
https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

I ran dbt debug --profiles-dir profile successfully.

However, for the following i got error:

However, I ran into the following error:
dbt run --profiles-dir profile

Error is as follows: ( Can someone point me to the fix please)

Concurrency: 1 threads (target='dev')
14:43:18  
14:43:18  1 of 4 START table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ............ [RUN]
14:43:22  Glue adapter: Glue returned `error` for statement None for code SqlWrapper2.execute('''create table dbt_nyc_metrics.silver_nyctaxi_avg_metrics__dbt_tmp
   
   
   
   LOCATION 's3://aws-dbt-glue-datalake-xxxxxxxxxxxxxx-us-east-1/dbt_nyc_metrics/silver_nyctaxi_avg_metrics__dbt_tmp/'
   
   as
     WITH source_avg as ( 
   SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
   , avg(passenger_count) as avg_passenger_count 
   , avg(trip_distance) as avg_trip_distance 
   , avg(total_amount) as avg_total_amount
   , year
   , month 
   , type
   FROM nyctaxi.records 
   WHERE year = "2016"
   AND dropoff_datetime is not null 
   GROUP BY 5, 6, 7
) 
SELECT *
FROM source_avg'''), AnalysisException: Table or view not found: nyctaxi.records; line 16 pos 9;
'CreateTable `dbt_nyc_metrics`.`silver_nyctaxi_avg_metrics__dbt_tmp`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- 'Project [*]
  +- 'SubqueryAlias source_avg
     +- 'Aggregate [unresolvedordinal(5), unresolvedordinal(6), unresolvedordinal(7)], ['avg(((cast('dropoff_datetime as bigint) - cast('pickup_datetime as bigint)) / 60)) AS avg_duration#0, 'avg('passenger_count) AS avg_passenger_count#1, 'avg('trip_distance) AS avg_trip_distance#2, 'avg('total_amount) AS avg_total_amount#3, 'year, 'month, 'type]
        +- 'Filter (('year = 2016) AND isnotnull('dropoff_datetime))
           +- 'UnresolvedRelation [nyctaxi, records], [], false

14:43:22  1 of 4 ERROR creating table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ... [ERROR in 3.74s]
14:43:22  2 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_cost_metrics ................. [SKIP]
14:43:22  3 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_distance_metrics ............. [SKIP]
14:43:22  4 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_passengers_metrics ........... [SKIP]
14:43:23  
14:43:23  Finished running 4 table models in 66.80s.
14:43:24  
14:43:24  Completed with 1 error and 0 warnings:
14:43:24  
14:43:24  Database Error in model silver_nyctaxi_avg_metrics (models/silver_metrics/silver_nyctaxi_avg_metrics.sql)
14:43:24    Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''create table dbt_nyc_metrics.silver_nyctaxi_avg_metrics__dbt_tmp
14:43:24        
14:43:24        
14:43:24        
14:43:24        LOCATION 's3://aws-dbt-glue-datalake-xxxxxxxxxxxx-us-east-1/dbt_nyc_metrics/silver_nyctaxi_avg_metrics__dbt_tmp/'
14:43:24        
14:43:24        as
14:43:24          WITH source_avg as ( 
14:43:24        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
14:43:24        , avg(passenger_count) as avg_passenger_count 
14:43:24        , avg(trip_distance) as avg_trip_distance 
14:43:24        , avg(total_amount) as avg_total_amount
14:43:24        , year
14:43:24        , month 
14:43:24        , type
14:43:24        FROM nyctaxi.records 
14:43:24        WHERE year = "2016"
14:43:24        AND dropoff_datetime is not null 
14:43:24        GROUP BY 5, 6, 7
14:43:24    ) 
14:43:24    SELECT *
14:43:24    FROM source_avg'''), AnalysisException: Table or view not found: nyctaxi.records; line 16 pos 9;
14:43:24    'CreateTable `dbt_nyc_metrics`.`silver_nyctaxi_avg_metrics__dbt_tmp`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
14:43:24    +- 'Project [*]
14:43:24       +- 'SubqueryAlias source_avg
14:43:24          +- 'Aggregate [unresolvedordinal(5), unresolvedordinal(6), unresolvedordinal(7)], ['avg(((cast('dropoff_datetime as bigint) - cast('pickup_datetime as bigint)) / 60)) AS avg_duration#0, 'avg('passenger_count) AS avg_passenger_count#1, 'avg('trip_distance) AS avg_trip_distance#2, 'avg('total_amount) AS avg_total_amount#3, 'year, 'month, 'type]
14:43:24             +- 'Filter (('year = 2016) AND isnotnull('dropoff_datetime))
14:43:24                +- 'UnresolvedRelation [nyctaxi, records], [], false
14:43:24    
14:43:24    compiled SQL at target/run/dbtgluenyctaxidemo/models/silver_metrics/silver_nyctaxi_avg_metrics.sql
14:43:24  
14:43:24  Done. PASS=0 WARN=0 ERROR=1 SKIP=3 TOTAL=4

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

Can't use existing tables as source

Describe the bug

I am having java.lang.IllegalArgumentException: Can not create a Path from an empty string error when I am trying to use an existing Athena view as a source of my model.

Similarly, when I create a model as a view using dbt, it doesn't appear in Athena UI, but do exist in Glue metadata and even can be queryable via Athena.

Do you have any idea what's wrong with the views?

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

Plugins:
  - spark: 1.2.0 - Up to date!

The operating system you're using:
macOS Monterey 12.6 (21G115) Apple M1

The output of python --version:

Python 3.9.6

[Question] Is it possible to use a single interactive session for multiple operations?

It looks that an interactive session is created for each dbt operation and it takes time to provision a new session. I wonder if it's possible to use a single interactive session for multiple operations eg) dbt debug, dbt run and dbt docs generate ...

Thanks
Jaehyeon

Add table properties to the tables created by DBT on the AWS Glue Catalog

Describe the feature

Currently, AWS Glue allows to create tables with properties. This properties can be useful to add information like:

who created this table?
department that owns ?
Is there any PII data?

Describe alternatives you've considered

Properties could be added on the dbt_project file or as jinja code in each table

Who will this benefit?

Different customers that need to have more metadata to filter their tables

Are you interested in contributing this feature?

sure!

Add tags to glue interactive sessions generated by dbt-glue

Describe the feature

I would like to tag the glue interactive sessions I'm creating using dbt-glue.

Support for Incremental/merge using Iceberg

Describe the feature

Support for merge when using Iceberg table format.

Describe alternatives you've considered

One of the most awesome features of dbt are incremental models, in combination with Iceberg, that now is fully supported in athena,it will make the dbt-glue adapter really powerful, giving the possibility to have a full lakehouse setup in AWS using only serverless feature.

Additional context

Having support for merge give the possibility to do transactional updates (delete/updates), but also will make the possibility to use unique_key in dbt, making possible to build cleaner silver/gold layers.

Who will this benefit?

Using table materialization can be expensive. Most of the use cases around analytics can be re-written to be incremental, this reduces cost and increase time to data.

Are you interested in contributing this feature?

Yes absolutely.

Add monitoring functionality to dbt-glue

Describe the feature

AWS Glue provides Spark UI, and CloudWatch logs and metrics for monitoring Glue jobs and interactive sessions.
Currently there is no way to send those parameters to the session.

Describe alternatives you've considered

Create a default_arguments parameter on the profile

Who will this benefit?

Monitoring is an important part of maintaining the reliability, availability, and performance of AWS Glue and your other AWS solutions

Are you interested in contributing this feature?

Yes

Add bucketing for file_format parquet

Describe the feature

Add the possibility to specify bucketing with file_format=parquet to improve performance on reading.

Describe alternatives you've considered

Nothing that come to mind, bucketing is different than partitioning.

Additional context

Bucketing can speed up queries when working with high cardinality fields, it's a must have.

Who will this benefit?

All the use cases where specific WHERE conditions are used on the bucketed columns (e.g. id/user_id), in general bucketing is recommended for high cardinality columns.

Are you interested in contributing this feature?

Yup.

Merge - second run creates duplicates

Describe the bug

Starting with a new model where the incremental mode is set to merge if the same data is loaded again then duplicates are loaded.

Steps To Reproduce

I have been playing with the settings and think we have found a bug with the merge logic
(if we are doing it correctly, and a small workaround)

Here is a sample repo with a simple model and a csv of test data - https://github.com/pixie79/dbt_glue_merge/blob/main/models/refined/customer_test.sql

What we have found is that if we just run the model, then it correctly creates the new table with the appropriate data. If we then rerun the model (either a full refresh or an incremental that also has some previous data caught in) which we would do to ensure at least once processing. Instead of merging the duplicated record it inserts new duplicated rows of data. If you run a third time you might expect to get a third duplicate but oddly you don't.

Workaround (or potential fix option)

Now for the work around, whilst with DBT you can not create an empty table it seems that if we update the model to have "limit 1" as the last line, then run it. It creates the table with one record in as expected. Then remove the limit 1 and rerun with either incremental or full refresh and it correctly loads all the data. At this point we expected the initial row to be duplicated but it was not (we have no idea why not). Again if you do another run the data is fine.
Just to point out so far we have not validated this is changing data only replaying exactly the same data (hence really there should be no duplicate).

Expected behavior

Second run of the same data to allow only once processing merge should either re-update the original record or not touch it as a duplicate.

System information

The output of dbt --version:

poetry run dbt --version                       
The currently activated Python version 3.9.15 is not supported by the project (>=3.10,<3.11).
Trying to find and use a compatible version. 
Using python3 (3.10.8)
Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - redshift: 1.3.0 - Up to date!
  - postgres: 1.3.1 - Up to date!
  - spark:    1.3.0 - Up to date!

Glue DB output override per SQL config

I have been looking at overriding the output schema and database. Ideally we have 3 or 4 different glue databases. I have tried this within the following code, which allows me to override the S3 location but not the Glue DB:

{{ config(
partition_by=['year','month','day','hour'],
file_format='parquet',
materialized='table',
target_schema='bronze',
target_database='bronze',
custom_location=“s3://“ + env_var('AWS_ACCOUNT_ID') + “-bronze/data/general_ledger_posting_event",
tags=["gl"]
) }}

Ideally I would like to be able to override them at a higher level in the dbt_project.yml
models:
projectname:

raw:

+tags:

- "raw"

+materialized: table

bronze:
  +file_format: hudi
  +tags:
    - "bronze"
  +materialized: table

Leaving out `unique_key` from HUDI incremental merge strategy gives cryptic errors

Describe the bug

Using materialized='incremental' and incremental_strategy='merge' with no unique_key gives a strange error in dbt-glue when using the HUDI file_format.

model.sql:

{{ config(materialized='incremental', incremental_strategy='merge', file_format='hudi') }}

with source_data as (
    select 1 as id
)

select *
from source_data

Steps To Reproduce

Create a simple model with HUDI file format, incremental materialization, merge strategy, and no unique key.

Expected behavior

A clear error is raised about invalid inputs.

Screenshots and log output

15:37:11  Glue adapter: Glue returned `error` for statement None for code SqlWrapper2.execute('''/* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "govcloud_demo", "target_name": "dev", "node_id": "model.govcloud_demo.my_first_dbt_model"} */
select * from analytics.my_first_dbt_model limit 1 '''), AnalysisException: Table or view not found: analytics.my_first_dbt_model; line 2 pos 14;

15:37:10  Glue adapter: Database Error
  Glue cursor returned `error` for statement None for code

  from pyspark.sql import SparkSession
  from pyspark.sql.functions import *
  spark = SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.Kr
yoSerializer") .getOrCreate()
  inputDf = spark.sql("""
  with source_data as (
      select 1 as id
  )

  select *
  from source_data
""")
  outputDf = inputDf.drop("dbt_unique_key").withColumn("update_hudi_ts",current_timestamp(
))
  if outputDf.count() > 0:
      if None is not None:

          combinedConf = {'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.us
e_jdbc':'false', 'hoodie.datasource.write.precombine.field': 'update_hudi_ts', 'hoodie.con
sistency.check.enabled': 'true', 'hoodie.datasource.write.recordkey.field': 'None', 'hoodi
e.table.name': 'my_first_dbt_model', 'hoodie.datasource.hive_sync.database': 'analytics',
'hoodie.datasource.hive_sync.table': 'my_first_dbt_model', 'hoodie.datasource.hive_sync.en
able': 'true',   'hoodie.bulkinsert.shuffle.parallelism': 20, 'hoodie.datasource.write.ope
ration': 'bulk_insert'}
          outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Overwrite
').save("s3://.../")
      else:
          combinedConf = {'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.us
e_jdbc':'false', 'hoodie.datasource.write.precombine.field': 'update_hudi_ts', 'hoodie.con
sistency.check.enabled': 'true', 'hoodie.datasource.write.recordkey.field': 'None', 'hoodi
e.table.name': 'my_first_dbt_model', 'hoodie.datasource.hive_sync.database': 'analytics',
'hoodie.datasource.hive_sync.table': 'my_first_dbt_model', 'hoodie.datasource.hive_sync.en
able': 'true',  'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.
hive.NonPartitionedExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.h
udi.keygen.NonpartitionedKeyGenerator','hoodie.index.type': 'GLOBAL_BLOOM', 'hoodie.bloom.
index.update.partition.path': 'true',  'hoodie.bulkinsert.shuffle.parallelism': 20, 'hoodi
e.datasource.write.operation': 'bulk_insert'}
          outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Overwrite
').save("s3://.../")

  spark.sql("""REFRESH TABLE analytics.my_first_dbt_model""")
  SqlWrapper2.execute("""SELECT * FROM analytics.my_first_dbt_model LIMIT 1""")

System information

The output of dbt --version:

Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - spark: 1.3.0 - Up to date!

The operating system you're using: MacOS

The output of python --version: Python 3.10.8

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

Detect schema changes when working with incremental and parquet when using --full-refresh

Describe the feature

When working with incremental materialization is often needed to add new columns. It will be nice to detect schema changes (as the adapter builds a tmp table) and add the new columns to the final table, as dbt often does for other adapters. Specifically, this feature is requested for parquet, I guess adding that for delta/hudi is a bit more complicated.

Running the model in --full-refresh mode should rebuild the table and add the new columns, and this seems not happening, I guess because the table is not dropped and recreated, and the original columns are still kept there.

Describe alternatives you've considered

When adding a new column is needed, a solution is to switch the model to table (from incremental), run the model, then switch back the model to incremental.

Additional context

Adding schema change detection will make the usage of incremental models smoother and reduce manual operations or workarounds.

Who will this benefit?

A good example is the following:
an analyst wants to add a new metric to a gold layer, she will just add a a new aggregate to the model and expect to see the new column to the table (when query it from athena), but that poor column won't be there.

Are you interested in contributing this feature?

Yup!

No output on all models on S3 when dbt threads is greater than 1

Describe the bug

When we configure more than one thread on the profile.yml, dbt runs and says that everything is OK, nevertheless there is no output on S3 on all models

Steps To Reproduce

Clean the output folders of your dbt project, then set your threads greater than 1 (ex: 3 (recommended value by dbt to start a project)), run your project and check the output on S3

workers: 3
worker_type: G.2X
threads: 3

Expected behavior

All output files written

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!
  Plugins:
  - spark: 1.2.0 - Up to date!
dbt-glue 0.2.6

The operating system you're using:
macos

The output of python --version:
Python 3.7.13

DBT model not able to run when multiple dbt models together

Describe the bug
We are runnning dbtvault models on glue and writing the data in hudi. We have three files for every table. One is view which contains the sql queries, the second model is a dbtvault stage model, which has the view model as a source model and is a table model with parquet format and the last one is the dbtvault satellite model which is incremental model having file format as hudi. Up till now what we were doing is running all these three models separately with different glue interactive sessions each. In that case everything seems to be working fine as it should. But when we try to run these models together in the same glue interactive session I face the error mentioned below. Weird thing is if I change the dbtvault stage model as view instead of a table everything works perfectly fine.

Steps To Reproduce

Connect dbtvault with glue
Create three models view stage and satellite
Run these dbtvault models simultaeneously in a way that all of them run in same glue interactive sessions
Glue version - 3.0
DBT version - 1.1.2
Spark - 1.1.0
file-format: hudi
spark.sql("""REFRESH TABLE dv_raw.sat_chrpf""") SqlWrapper2.execute("""SELECT * FROM dv_raw.sat_chrpf LIMIT 1""") , Py4JJavaError: An error occurred while calling o167.save. : org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:648) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:647) at scala.collection.mutable.HashSet.foreach(HashSet.scala:77) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:647) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:734) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:338) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91) at org.apache.hudi.sync.common.util.SyncUtilHelpers.instantiateMetaSyncTool(SyncUtilHelpers.java:75) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56) ... 45 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89) ... 47 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Got runtime exception when hive syncing at org.apache.hudi.hive.HiveSyncTool.initSyncClient(HiveSyncTool.java:106) at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:95) ... 52 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to create HiveMetaStoreClient at org.apache.hudi.hive.HoodieHiveSyncClient.<init>(HoodieHiveSyncClient.java:95) at org.apache.hudi.hive.HiveSyncTool.initSyncClient(HiveSyncTool.java:101) ... 53 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassCastException: class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:239) at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291) at org.apache.hudi.hive.ddl.HiveQueryDDLExecutor.<init>(HiveQueryDDLExecutor.java:62) at org.apache.hudi.hive.HoodieHiveSyncClient.<init>(HoodieHiveSyncClient.java:91) ... 54 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassCastException: class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3991) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234) ... 60 more Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassCastException: class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory) at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClientFactory(HiveUtils.java:525) at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:506) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988) ... 62 more

Can't write simple table using Hudi 0.11.0 and Spark 3.1.1 (Glue 3.0)

Describe the bug

I'm trying to write a simple table ie. 2 fields 1 bigint and the other string. And after dbt run I get: org.apache.hudi.exception.HoodieException: 'hoodie.table.name' must be set. even the SQL query having the table name.

Steps To Reproduce

Just tried to write a simple table using Spark 3.1.1 (Glue 3.0) and hudi 0.11.0 (also I tried with 0.10.1 aws and 0.7.0 aws version)

{{ config(
    materialized = 'incremental',
    unique_key='id',
    incremental_strategy = 'append',
) }}

{% if not is_incremental() %}

select cast(1 as bigint) as id, 'hello' as msg
union all
select cast(2 as bigint) as id, 'goodbye' as msg

{% else %}

select cast(2 as bigint) as id, 'yo' as msg
union all
select cast(3 as bigint) as id, 'anyway' as msg

{% endif %}

Expected behavior

/* after running the model twice this should create 4 rows in the table
 * | id | msg     |
 * |----|---------|
 * | 1  | hello   |
 * | 2  | goodbye |
 * | 2  | yo      |
 * | 3  | anyway  |
 */

Screenshots and log output

7:41:03.987255 [error] [MainThread]: Database Error in model test_hudi__append (models/test_hudi/test_hudi__append.sql)
17:41:03.987609 [error] [MainThread]:   Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''create table bi_tmp.test_hudi__append
17:41:03.987963 [error] [MainThread]:
17:41:03.988301 [error] [MainThread]:       using hudi
17:41:03.988640 [error] [MainThread]:
17:41:03.988961 [error] [MainThread]:
17:41:03.989302 [error] [MainThread]:       LOCATION 's3://bucket/dbt_glue/data/bi_tmp/test_hudi__append/'
17:41:03.989648 [error] [MainThread]:
17:41:03.990001 [error] [MainThread]:       as
17:41:03.990611 [error] [MainThread]:
17:41:03.991199 [error] [MainThread]:
17:41:03.991647 [error] [MainThread]:   /* after running the model twice this should create 4 rows in the table
17:41:03.992074 [error] [MainThread]:    * | id | msg     |
17:41:03.992496 [error] [MainThread]:    * |----|---------|
17:41:03.992915 [error] [MainThread]:    * | 1  | hello   |
17:41:03.993419 [error] [MainThread]:    * | 2  | goodbye |
17:41:03.993796 [error] [MainThread]:    * | 2  | yo      |
17:41:03.994168 [error] [MainThread]:    * | 3  | anyway  |
17:41:03.994669 [error] [MainThread]:    */
17:41:03.995227 [error] [MainThread]:
17:41:03.995630 [error] [MainThread]:
17:41:03.995983 [error] [MainThread]:
17:41:03.996317 [error] [MainThread]:   select cast(1 as bigint) as id, 'hello' as msg
17:41:03.996639 [error] [MainThread]:   union all
17:41:03.996974 [error] [MainThread]:   select cast(2 as bigint) as id, 'goodbye' as msg
17:41:03.997297 [error] [MainThread]:
17:41:03.997617 [error] [MainThread]:   '''), Py4JJavaError: An error occurred while calling o73.sql.
17:41:03.997939 [error] [MainThread]:   : org.apache.hudi.exception.HoodieException: 'hoodie.table.name' must be set.
17:41:03.998357 [error] [MainThread]:   	at org.apache.hudi.common.config.HoodieConfig.getStringOrThrow(HoodieConfig.java:217)
17:41:03.998743 [error] [MainThread]:   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:96)
17:41:03.999100 [error] [MainThread]:   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
17:41:03.999448 [error] [MainThread]:   	at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:530)
17:41:03.999822 [error] [MainThread]:   	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:220)
17:41:04.000145 [error] [MainThread]:   	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177)
17:41:04.000473 [error] [MainThread]:   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
17:41:04.000833 [error] [MainThread]:   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
17:41:04.001179 [error] [MainThread]:   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
17:41:04.001508 [error] [MainThread]:   	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
17:41:04.001891 [error] [MainThread]:   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724)
17:41:04.002221 [error] [MainThread]:   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
17:41:04.002559 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
17:41:04.002886 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
17:41:04.003386 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
17:41:04.003824 [error] [MainThread]:   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
17:41:04.004209 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
17:41:04.004695 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
17:41:04.005129 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
17:41:04.005505 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
17:41:04.006294 [error] [MainThread]:   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
17:41:04.006685 [error] [MainThread]:   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
17:41:04.007027 [error] [MainThread]:   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722)
17:41:04.007355 [error] [MainThread]:   	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
17:41:04.007691 [error] [MainThread]:   	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
17:41:04.008052 [error] [MainThread]:   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
17:41:04.008396 [error] [MainThread]:   	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
17:41:04.017237 [error] [MainThread]:   	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
17:41:04.017927 [error] [MainThread]:   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
17:41:04.018338 [error] [MainThread]:   	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
17:41:04.018671 [error] [MainThread]:   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
17:41:04.019077 [error] [MainThread]:   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
17:41:04.019552 [error] [MainThread]:   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
17:41:04.020243 [error] [MainThread]:   	at java.lang.reflect.Method.invoke(Method.java:498)
17:41:04.020623 [error] [MainThread]:   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
17:41:04.021027 [error] [MainThread]:   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
17:41:04.021671 [error] [MainThread]:   	at py4j.Gateway.invoke(Gateway.java:282)
17:41:04.022133 [error] [MainThread]:   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
17:41:04.022708 [error] [MainThread]:   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
17:41:04.023158 [error] [MainThread]:   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
17:41:04.023506 [error] [MainThread]:   	at java.lang.Thread.run(Thread.java:750)
17:41:04.023841 [error] [MainThread]:
17:41:04.024171 [error] [MainThread]:   compiled SQL at target/run/lakehouse_spark/models/test_hudi/test_hudi__append.sql
17:41:04.024541 [info ] [MainThread]:
17:41:04.024925 [info ] [MainThread]: Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
17:41:04.025461 [debug] [MainThread]: Flushing usage events
17:41:04.026044 [debug] [MainThread]: Glue adapter: cleanup called

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.0 - Up to date!

Plugins:
  - spark: 1.1.0 - Up to date!

The operating system you're using:
MacOs 12.4

The output of python --version:
Python 3.7.9

Table information are missing for dbt-glue Iceberg Tables in Glue Data Catalog and Lake Formation console

Describe the bug

When working with Iceberg Tables created by the current version of dbt-glue, the table information are not displayed in the UI. The only information filled are the advanced following properties :

TABLE_TYPE : ICEBERG
METADATA LOCATION
METADATA_PREVIOUS_LOCATION

Steps To Reproduce

Create an Iceberg table using dbt-glue
Find your table in the AWS Glue Data Catalog (or AWS Lake Formation) console

Screenshots and log output

Glue returned `error` for statement None for code SqlWrapper2.execute()

Describe the bug

I tried to set up a simple project to explore dbt+glue: read and write from/to parquet file. Run dbt run. But there is an error.

Steps To Reproduce

create an S3 bucket with parquet files
defined a source database/table and run a crawler in DataCatalog
create an IAM role with permissions (described in the documentation)
create a target s3 bucket where to save the output files
create a destination database in DataCatalog

Expected behavior

To create the expected s3 files

Screenshots and log output

14:04:36  Running with dbt=1.2.1
14:04:36  Unable to do partial parsing because profile has changed
14:04:36  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 323 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
14:04:36  
14:05:19  Concurrency: 1 threads (target='dev')
14:05:19  
14:05:19  1 of 1 START table model dbt_que_demo.my_first_dbt_model ....................... [RUN]

Glue adapter: Glue returned `error` for statement None for code SqlWrapper2.execute('''
    create table dbt_que_demo.my_first_dbt_model
    using parquet
    LOCATION 's3://demo-bucket-dai/dbt_que_demo/my_first_dbt_model/'
    as
    select tenant, id, datetime_gmt, report_type
    from glue_source_db.reports
    limit 10
'''), 

Py4JError: An error occurred while calling o91.toString. Trace:
java.lang.IllegalArgumentException: object is not an instance of declaring class
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)

14:05:27  1 of 1 ERROR creating table model dbt_que_demo.my_first_dbt_model .............. [ERROR in 7.57s]

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

Plugins:
  - spark: 1.2.0 - Up to date!

The operating system you're using:

System Version: macOS 12.4 (21F79)
Kernel Version: Darwin 21.5.0
Chip: Apple M1

The output of python --version:
3.9.14

The output of pip list:
github gist

Additional context

None

Force database parameter must be omitted or have the same value as schema

Describe the feature

Force database parameter must be omitted or have the same value as schema

Describe alternatives you've considered

Modify credentials to use the same that dbt-spark is using

Who will this benefit?

Everyone that is using the adapter.

Are you interested in contributing this feature?

yes

Airflow example

It will be nice if we have and example of how to use it integrated with airflow.

aws-samples / dbt-glue Goto Github PK

dbt-glue's Introduction

dbt-glue

Installation

Connection Methods

Configuring your AWS profile for Glue Interactive Session

Configuration of the local environment

Example config

Configs

Configuring tables

Incremental models

The append strategy

Source code

Run Code

The insert_overwrite strategy

Source Code

Run Code

The merge strategy

Hudi

Profile config example

Source Code example

Delta

Profile config example

Source Code example

Iceberg

Notes

Profile config example

Source Code example

Iceberg Snapshot source code example

Monitoring your Glue Interactive Session

Profile config example

Enabling AWS Glue Auto Scaling

Profile config example

Access Glue catalog in another AWS account

Profile config example

Persisting model descriptions

Always schema, never database

AWS Lakeformation integration

At model level

At dbt-project level

Tests

Caveats

Supported Functionality

Security

License

dbt-glue's People

Contributors

Stargazers

Watchers

Forkers

dbt-glue's Issues

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Describe the bug

Expected behavior

System information

Additional context

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Bug

Example

Solution

Describe the bug

Expected behavior

Describe the bug

Steps To Reproduce

Screenshots and log output

Additional context

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

The `append` strategy

The `insert_overwrite` strategy

The `merge` strategy

Always `schema`, never `database`