duckdb / dbt-duckdb Goto Github PK

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)

License: Apache License 2.0

Python 99.13% Dockerfile 0.72% Shell 0.15%

dbt duckdb

dbt-duckdb's Introduction

dbt-duckdb

DuckDB is an embedded database, similar to SQLite, but designed for OLAP-style analytics. It is crazy fast and allows you to read and write data stored in CSV, JSON, and Parquet files directly, without requiring you to load them into the database first.

dbt is the best way to manage a collection of data transformations written in SQL or Python for analytics and data science. dbt-duckdb is the project that ties DuckDB and dbt together, allowing you to create a Modern Data Stack In A Box or a simple and powerful data lakehouse with Python.

Installation

This project is hosted on PyPI, so you should be able to install it and the necessary dependencies via:

pip3 install dbt-duckdb

The latest supported version targets dbt-core 1.7.x and duckdb version 0.10.x, but we work hard to ensure that newer versions of DuckDB will continue to work with the adapter as they are released. If you would like to use our new (and experimental!) support for persisting the tables that DuckDB creates to the AWS Glue Catalog, you should install dbt-duckdb[glue] in order to get the AWS dependencies as well.

Configuring Your Profile

A super-minimal dbt-duckdb profile only needs one setting:

default:
  outputs:
    dev:
      type: duckdb
  target: dev

This will run your dbt-duckdb pipeline against an in-memory DuckDB database that will not be persisted after your run completes. This may not seem very useful at first, but it turns out to be a powerful tool for a) testing out data pipelines, either locally or in CI jobs and b) running data pipelines that operate purely on external CSV, Parquet, or JSON files. More details on how to work with external data files in dbt-duckdb are provided in the docs on reading and writing external files.

To have your dbt pipeline persist relations in a DuckDB file, set the path field in your profile to the path of the DuckDB file that you would like to read and write on your local filesystem. (For in-memory pipelines, the path is automatically set to the special value :memory:).

dbt-duckdb also supports common profile fields like schema and threads, but the database property is special: its value is automatically set to the basename of the file in the path argument with the suffix removed. For example, if the path is /tmp/a/dbfile.duckdb, the database field will be set to dbfile. If you are running in in-memory mode, then the database property will be automatically set to memory.

Using MotherDuck

As of dbt-duckdb 1.5.2, you can connect to a DuckDB instance running on MotherDuck by setting your path to use a md: connection string, just as you would with the DuckDB CLI or the Python API.

MotherDuck databases generally work the same way as local DuckDB databases from the perspective of dbt, but there are a few differences to be aware of:

Currently, MotherDuck requires a specific version of DuckDB, often the latest, as specified in MotherDuck's documentation
MotherDuck databases do not suppport transactions, so there is a new disable_transactions profile. option that will be automatically enabled if you are connecting to a MotherDuck database in your path.
MotherDuck preloads a set of the most common DuckDB extensions for you, but does not support loading custom extensions or user-defined functions.
A small subset of advanced SQL features are currently unsupported; the only impact of this on the dbt adapter is that the dbt.listagg macro and foreign-key constraints will work against a local DuckDB database, but will not work against a MotherDuck database.

DuckDB Extensions, Settings, and Filesystems

You can load any supported DuckDB extensions by listing them in the extensions field in your profile. You can also set any additional DuckDB configuration options via the settings field, including options that are supported in any loaded extensions. For example, to be able to connect to S3 and read/write Parquet files using an AWS access key and secret, your profile would look something like this:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

As of version 1.4.1, we have added (experimental!) support for DuckDB's (experimental!) support for filesystems implemented via fsspec. The fsspec library provides support for reading and writing files from a variety of cloud data storage systems including S3, GCS, and Azure Blob Storage. You can configure a list of fsspec-compatible implementations for use with your dbt-duckdb project by installing the relevant Python modules and configuring your profile like so:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      filesystems:
        - fs: s3
          anon: false
          key: "{{ env_var('S3_ACCESS_KEY_ID') }}"
          secret: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
          client_kwargs:
            endpoint_url: "http://localhost:4566"
  target: dev

Here, the filesystems property takes a list of configurations, where each entry must have a property named fs that indicates which fsspec protocol to load (so s3, gcs, abfs, etc.) and then an arbitrary set of other key-value pairs that are used to configure the fsspec implementation. You can see a simple example project that illustrates the usage of this feature to connect to a Localstack instance running S3 from dbt-duckdb here.

Fetching credentials from context

Instead of specifying the credentials through the settings block, you can also use the use_credential_provider property. If you set this to aws (currently the only supported implementation) and you have boto3 installed in your python environment, we will fetch your AWS credentials using the credential provider chain as described here. This means that you can use any supported mechanism from AWS to obtain credentials (e.g., web identity tokens).

Attaching Additional Databases

DuckDB version 0.7.0 added support for attaching additional databases to your dbt-duckdb run so that you can read and write from multiple databases. Additional databases may be configured using dbt run hooks or via the attach argument in your profile that was added in dbt-duckdb 1.4.0:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      attach:
        - path: /tmp/other.duckdb
        - path: ./yet/another.duckdb
          alias: yet_another
        - path: s3://yep/even/this/works.duckdb
          read_only: true
        - path: sqlite.db
          type: sqlite

The attached databases may be referred to in your dbt sources and models by either the basename of the database file minus its suffix (e.g., /tmp/other.duckdb is the other database and s3://yep/even/this/works.duckdb is the works database) or by an alias that you specify (so the ./yet/another.duckdb database in the above configuration is referred to as yet_another instead of another.) Note that these additional databases do not necessarily have to be DuckDB files: DuckDB's storage and catalog engines are pluggable, and DuckDB 0.7.0 ships with support for reading and writing from attached SQLite databases. You can indicate the type of the database you are connecting to via the type argument, which currently supports duckdb and sqlite.

Configuring dbt-duckdb Plugins

dbt-duckdb has its own plugin system to enable advanced users to extend dbt-duckdb with additional functionality, including:

Defining custom Python UDFs on the DuckDB database connection so that they can be used in your SQL models
Loading source data from Excel, Google Sheets, or SQLAlchemy tables

You can find more details on how to write your own plugins here. To configure a plugin for use in your dbt project, use the plugins property on the profile:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      plugins:
        - module: gsheet
          config:
            method: oauth
        - module: sqlalchemy
          alias: sql
          config:
            connection_url: "{{ env_var('DBT_ENV_SECRET_SQLALCHEMY_URI') }}"
        - module: path.to.custom_udf_module

Every plugin must have a module property that indicates where the Plugin class to load is defined. There is a set of built-in plugins that are defined in dbt.adapters.duckdb.plugins that may be referenced by their base filename (e.g., excel or gsheet), while user-defined plugins (which are described later in this document) should be referred to via their full module path name (e.g. a lib.my.custom module that defines a class named Plugin.)

Each plugin instance has a name for logging and reference purposes that defaults to the name of the module but that may be overridden by the user by setting the alias property in the configuration. Finally, modules may be initialized using an arbitrary set of key-value pairs that are defined in the config dictionary. In this example, we initialize the gsheet plugin with the setting method: oauth and we initialize the sqlalchemy plugin (aliased as "sql") with a connection_url that is set via an environment variable.

Please remember that using plugins may require you to add additional dependencies to the Python environment that your dbt-duckdb pipeline runs in:

excel depends on pandas, and openpyxl or xlsxwriter to perform writes
gsheet depends on gspread and pandas
iceberg depends on pyiceberg and Python >= 3.8
sqlalchemy depends on pandas, sqlalchemy, and the driver(s) you need

Experimental:

delta depends on deltalake, an example project

Note: Be aware that experimental features can change over time, and we would like your feedback on config and possible different use cases.

Using Local Python Modules

In dbt-duckdb 1.6.0, we added a new profile setting named module_paths that allows users to specify a list of paths on the filesystem that contain additional Python modules that should be added to the Python processes' sys.path property. This allows users to include additional helper Python modules in their dbt projects that can be accessed by the running dbt process and used to define custom dbt-duckdb Plugins or library code that is helpful for creating dbt Python models.

Reading and Writing External Files

One of DuckDB's most powerful features is its ability to read and write CSV, JSON, and Parquet files directly, without needing to import/export them from the database first.

Reading from external files

You may reference external files in your dbt models either directly or as dbt sources by configuring the external_location in either the meta or the config option on the source definition. The difference is that settings under the meta option will be propagated to the documentation for the source generated via dbt docs generate, but the settings under the config option will not be. Any source settings that should be excluded from the docs should be specified via config, while any options that you would like to be included in the generated documentation should live under meta.

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/my-sources/{name}.parquet"
    tables:
      - name: source1
      - name: source2

Here, the meta options on external_source defines external_location as an f-string that allows us to express a pattern that indicates the location of any of the tables defined for that source. So a dbt model like:

SELECT *
FROM {{ source('external_source', 'source1') }}

will be compiled as:

SELECT *
FROM 's3://my-bucket/my-sources/source1.parquet'

If one of the source tables deviates from the pattern or needs some other special handling, then the external_location can also be set on the meta options for the table itself, for example:

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/my-sources/{name}.parquet"
    tables:
      - name: source1
      - name: source2
        config:
          external_location: "read_parquet(['s3://my-bucket/my-sources/source2a.parquet', 's3://my-bucket/my-sources/source2b.parquet'])"

In this situation, the external_location setting on the source2 table will take precedence, so a dbt model like:

SELECT *
FROM {{ source('external_source', 'source2') }}

will be compiled to the SQL query:

SELECT *
FROM read_parquet(['s3://my-bucket/my-sources/source2a.parquet', 's3://my-bucket/my-sources/source2b.parquet'])

Note that the value of the external_location property does not need to be a path-like string; it can also be a function call, which is helpful in the case that you have an external source that is a CSV file which requires special handling for DuckDB to load it correctly:

sources:
  - name: flights_source
    tables:
      - name: flights
        config:
          external_location: "read_csv('flights.csv', types={'FlightDate': 'DATE'}, names=['FlightDate', 'UniqueCarrier'])"
          formatter: oldstyle

Note that we need to override the default str.format string formatting strategy for this example because the types={'FlightDate': 'DATE'} argument to the read_csv function will be interpreted by str.format as a template to be matched on, which will cause a KeyError: "'FlightDate'" when we attempt to parse the source in a dbt model. The formatter configuration option for the source indicates whether we should use newstyle string formatting (the default), oldstyle string formatting, or template string formatting. You can read up on the strategies the various string formatting techniques use at this Stack Overflow answer and see examples of their use in this dbt-duckdb integration test.

Writing to external files

We support creating dbt models that are backed by external files via the external materialization strategy:

{{ config(materialized='external', location='local/directory/file.parquet') }}
SELECT m.*, s.id IS NOT NULL as has_source_id
FROM {{ ref('upstream_model') }} m
LEFT JOIN {{ source('upstream', 'source') }} s USING (id)

Option	Default	Description
location	external_location macro	The path to write the external materialization to. See below for more details.
format	parquet	The format of the external file (parquet, csv, or json)
delimiter	,	For CSV files, the delimiter to use for fields.
options	None	Any other options to pass to DuckDB's `COPY` operation (e.g., `partition_by`, `codec`, etc.)
glue_register	false	If true, try to register the file created by this model with the AWS Glue Catalog.
glue_database	default	The name of the AWS Glue database to register the model with.

If the location argument is specified, it must be a filename (or S3 bucket/path), and dbt-duckdb will attempt to infer the format argument from the file extension of the location if the format argument is unspecified (this functionality was added in version 1.4.1.)

If the location argument is not specified, then the external file will be named after the model.sql (or model.py) file that defined it with an extension that matches the format argument (parquet, csv, or json). By default, the external files are created relative to the current working directory, but you can change the default directory (or S3 bucket/prefix) by specifying the external_root setting in your DuckDB profile.

dbt-duckdb supports the delete+insert and append strategies for incremental table models, but unfortunately it does not yet support incremental materialization strategies for external models.

Re-running external models with an in-memory version of dbt-duckdb

When using :memory: as the DuckDB database, subsequent dbt runs can fail when selecting a subset of models that depend on external tables. This is because external files are only registered as DuckDB views when they are created, not when they are referenced. To overcome this issue we have provided the register_upstream_external_models macro that can be triggered at the beginning of a run. To enable this automatic registration, place the following in your dbt_project.yml file:

on-run-start:
  - "{{ register_upstream_external_models() }}"

Python Support

dbt added support for Python models in version 1.3.0. For most data platforms, dbt will package up the Python code defined in a .py file and ship it off to be executed in whatever Python environment that data platform supports (e.g., Snowpark for Snowflake or Dataproc for BigQuery.) In dbt-duckdb, we execute Python models in the same process that owns the connection to the DuckDB database, which by default, is the Python process that is created when you run dbt. To execute the Python model, we treat the .py file that your model is defined in as a Python module and load it into the running process using importlib. We then construct the arguments to the model function that you defined (a dbt object that contains the names of any ref and source information your model needs and a DuckDBPyConnection object for you to interact with the underlying DuckDB database), call the model function, and then materialize the returned object as a table in DuckDB.

The value of the dbt.ref and dbt.source functions inside of a Python model will be a DuckDB Relation object that can be easily converted into a Pandas/Polars DataFrame or an Arrow table. The return value of the model function can be any Python object that DuckDB knows how to turn into a table, including a Pandas/Polars DataFrame, a DuckDB Relation, or an Arrow Table, Dataset, RecordBatchReader, or Scanner.

Batch processing with Python models

As of version 1.6.1, it is possible to both read and write data in chunks, which allows for larger-than-memory datasets to be manipulated in Python models. Here is a basic example:

import pyarrow as pa

def batcher(batch_reader: pa.RecordBatchReader):
    for batch in batch_reader:
        df = batch.to_pandas()
        # Do some operations on the DF...
        # ...then yield back a new batch
        yield pa.RecordBatch.from_pandas(df)

def model(dbt, session):
    big_model = dbt.ref("big_model")
    batch_reader = big_model.record_batch(100_000)
    batch_iter = batcher(batch_reader)
    return pa.RecordBatchReader.from_batches(batch_reader.schema, batch_iter)

Writing Your Own Plugins

Defining your own dbt-duckdb plugin is as simple as creating a python module that defines a class named Plugin that inherits from dbt.adapters.duckdb.plugins.BasePlugin. There are currently four methods that may be implemented in your Plugin class:

initialize: Takes in the config dictionary for the plugin that is defined in the profile to enable any additional configuration for the module based on the project; this method is called once when an instance of the Plugin class is created.
configure_connection: Takes an instance of the DuckDBPyConnection object used to connect to the DuckDB database and may perform any additional configuration of that object that is needed by the plugin, like defining custom user-defined functions.
load: Takes a SourceConfig instance, which encapsulates the configuration for a a dbt source and can optionally return a DataFrame-like object that DuckDB knows how to turn into a table (this is similar to a dbt-duckdb Python model, but without the ability to ref any models or access any information beyond the source config.)
store: Takes a TargetConfig instance, which encapsulates the configuration for an external materialization and can perform additional operations once the CSV/Parquet/JSON file is written. The glue and sqlalchemy are examples that demonstrate how to use the store operation to register an AWS Glue database table or upload a DataFrame to an external database, respectively.

dbt-duckdb ships with a number of built-in plugins that can be used as examples for implementing your own.

Roadmap

Things that we would like to add in the near future:

Support for Delta and Iceberg external table formats (both as sources and destinations)
Make dbt's incremental models and snapshots work with external materializations

dbt-duckdb's People

Contributors

Stargazers

Watchers

Forkers

forestlzj franciscomagioli columbiaviz nathanhowell dbeatty10 exemaitch kj-9 nicku33 henmachado edgarrmondragon tomsej jpmmcneill markroddy alexandervr zkan cabeda keiththompson godatadriven nclaeys hubayirp danthelion rohankumardubey tailwind-vn matinnuhamunada amiller-gh danhphan vlieven dwreeves armsepehr n-clerkx turbo1912 jtcohen6 arpitjain799 cmu-course bl3f motherduckdb gmontanola dforsber jcotton1123 thfmn l1xnan jmhsieh milicevica23 droher shrey0806 amaralvieira ahuachen nintorac mehd-io danieldiamond fork-arhive szarnyasg sujanadh firewall413 ayushsubedi williamdee karimfaiz johalnes b-per obronco hadousoftware hrl20 lorran guenp jaceksan pnadolny13 alittlesliceoftom

dbt-duckdb's Issues

dbt fails to run on duckdb==0.4.0

Any dbt command (I personally tried dbt seed and dbt run) fails with the following error:

Runtime Error
  TransactionContext Error: cannot commit - no transaction is active

Downgrading to duckdb v0.3.2 resolves the issue.

Looks like duckdb only works on single-threaded operations

dbt-core version: dbt-core 1.0.8
adapter version: dbt-duckdb 1.1.1

# my config on a single thread that works
jaffle_shop:

  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'jaffle_shop.duckdb'

 ~/De/jaffle_shop_duckdb │ on duckdb !1  dbt build     ✔ │ took 5s │ jaffle_shop_duckdb Py │ system Node 
02:25:56  Running with dbt=1.0.8
02:25:56  Unable to do partial parsing because profile has changed
02:25:57  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
02:25:57  
02:25:57  Concurrency: 1 threads (target='dev')
02:25:57  
02:25:57  1 of 28 START seed file main.raw_customers...................................... [RUN]
02:25:57  1 of 28 OK loaded seed file main.raw_customers.................................. [INSERT 100 in 0.13s]
02:25:57  2 of 28 START seed file main.raw_orders......................................... [RUN]
02:25:57  2 of 28 OK loaded seed file main.raw_orders..................................... [INSERT 99 in 0.08s]
02:25:57  3 of 28 START seed file main.raw_payments....................................... [RUN]
02:25:57  3 of 28 OK loaded seed file main.raw_payments................................... [INSERT 113 in 0.08s]
02:25:57  4 of 28 START view model main.stg_customers..................................... [RUN]
02:25:57  4 of 28 OK created view model main.stg_customers................................ [OK in 0.14s]
02:25:57  5 of 28 START view model main.stg_orders........................................ [RUN]
02:25:57  5 of 28 OK created view model main.stg_orders................................... [OK in 0.07s]
02:25:57  6 of 28 START view model main.stg_payments...................................... [RUN]
02:25:58  6 of 28 OK created view model main.stg_payments................................. [OK in 0.07s]
02:25:58  7 of 28 START test not_null_stg_customers_customer_id........................... [RUN]
02:25:58  7 of 28 PASS not_null_stg_customers_customer_id................................. [PASS in 0.09s]
02:25:58  8 of 28 START test unique_stg_customers_customer_id............................. [RUN]
02:25:58  8 of 28 PASS unique_stg_customers_customer_id................................... [PASS in 0.07s]
02:25:58  9 of 28 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [RUN]
02:25:58  9 of 28 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [PASS in 0.07s]
02:25:58  10 of 28 START test not_null_stg_orders_order_id................................ [RUN]
02:25:58  10 of 28 PASS not_null_stg_orders_order_id...................................... [PASS in 0.07s]
02:25:58  11 of 28 START test unique_stg_orders_order_id.................................. [RUN]
02:25:58  11 of 28 PASS unique_stg_orders_order_id........................................ [PASS in 0.07s]
02:25:58  12 of 28 START test accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [RUN]
02:25:58  12 of 28 PASS accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [PASS in 0.07s]
02:25:58  13 of 28 START test not_null_stg_payments_payment_id............................ [RUN]
02:25:58  13 of 28 PASS not_null_stg_payments_payment_id.................................. [PASS in 0.07s]
02:25:58  14 of 28 START test unique_stg_payments_payment_id.............................. [RUN]
02:25:58  14 of 28 PASS unique_stg_payments_payment_id.................................... [PASS in 0.07s]
02:25:58  15 of 28 START table model main.customers....................................... [RUN]
02:25:58  15 of 28 OK created table model main.customers.................................. [OK in 0.09s]
02:25:58  16 of 28 START table model main.orders.......................................... [RUN]
02:25:58  16 of 28 OK created table model main.orders..................................... [OK in 0.08s]
02:25:58  17 of 28 START test not_null_customers_customer_id.............................. [RUN]
02:25:58  17 of 28 PASS not_null_customers_customer_id.................................... [PASS in 0.07s]
02:25:58  18 of 28 START test unique_customers_customer_id................................ [RUN]
02:25:58  18 of 28 PASS unique_customers_customer_id...................................... [PASS in 0.07s]
02:25:58  19 of 28 START test accepted_values_orders_status__placed__shipped__completed__return_pending__returned [RUN]
02:25:59  19 of 28 PASS accepted_values_orders_status__placed__shipped__completed__return_pending__returned [PASS in 0.07s]
02:25:59  20 of 28 START test not_null_orders_amount...................................... [RUN]
02:25:59  20 of 28 PASS not_null_orders_amount............................................ [PASS in 0.07s]
02:25:59  21 of 28 START test not_null_orders_bank_transfer_amount........................ [RUN]
02:25:59  21 of 28 PASS not_null_orders_bank_transfer_amount.............................. [PASS in 0.07s]
02:25:59  22 of 28 START test not_null_orders_coupon_amount............................... [RUN]
02:25:59  22 of 28 PASS not_null_orders_coupon_amount..................................... [PASS in 0.07s]
02:25:59  23 of 28 START test not_null_orders_credit_card_amount.......................... [RUN]
02:25:59  23 of 28 PASS not_null_orders_credit_card_amount................................ [PASS in 0.07s]
02:25:59  24 of 28 START test not_null_orders_customer_id................................. [RUN]
02:25:59  24 of 28 PASS not_null_orders_customer_id....................................... [PASS in 0.07s]
02:25:59  25 of 28 START test not_null_orders_gift_card_amount............................ [RUN]
02:25:59  25 of 28 PASS not_null_orders_gift_card_amount.................................. [PASS in 0.07s]
02:25:59  26 of 28 START test not_null_orders_order_id.................................... [RUN]
02:25:59  26 of 28 PASS not_null_orders_order_id.......................................... [PASS in 0.07s]
02:25:59  27 of 28 START test relationships_orders_customer_id__customer_id__ref_customers_ [RUN]
02:25:59  27 of 28 PASS relationships_orders_customer_id__customer_id__ref_customers_..... [PASS in 0.07s]
02:25:59  28 of 28 START test unique_orders_order_id...................................... [RUN]
02:25:59  28 of 28 PASS unique_orders_order_id............................................ [PASS in 0.07s]
02:25:59  
02:25:59  Finished running 3 seeds, 3 view models, 20 tests, 2 table models in 2.49s.
02:25:59  
02:25:59  Completed successfully
02:25:59  
02:25:59  Done. PASS=28 WARN=0 ERROR=0 SKIP=0 TOTAL=28

My gut tells me there are weird race conditions going on there across the threads. I haven't dug in deep yet, but wanted to point this out. This command sometimes works if I'm lucky on multiple threads.

# my config on 16 threads that does NOT work
jaffle_shop:

  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'jaffle_shop.duckdb'
      threads: 16

 ~/De/jaffle_shop_duckdb │ on duckdb !1  dbt build --full-refresh                                        ✔ │ jaffle_shop_duckdb Py │ system Node 
02:30:25  Running with dbt=1.0.8
02:30:25  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
02:30:25  
02:30:26  Concurrency: 16 threads (target='dev')
02:30:26  
02:30:26  1 of 28 START seed file main.raw_customers...................................... [RUN]
02:30:26  2 of 28 START seed file main.raw_orders......................................... [RUN]
02:30:26  3 of 28 START seed file main.raw_payments....................................... [RUN]
02:30:26  3 of 28 OK loaded seed file main.raw_payments................................... [CREATE 113 in 0.20s]
02:30:26  1 of 28 OK loaded seed file main.raw_customers.................................. [CREATE 100 in 0.25s]
02:30:26  2 of 28 OK loaded seed file main.raw_orders..................................... [CREATE 99 in 0.30s]
02:30:26  4 of 28 START view model main.stg_payments...................................... [RUN]
02:30:26  5 of 28 START view model main.stg_customers..................................... [RUN]
02:30:26  6 of 28 START view model main.stg_orders........................................ [RUN]
02:30:26  5 of 28 ERROR creating view model main.stg_customers............................ [ERROR in 0.17s]
02:30:26  6 of 28 ERROR creating view model main.stg_orders............................... [ERROR in 0.17s]
02:30:26  7 of 28 SKIP test not_null_stg_customers_customer_id............................ [SKIP]
02:30:26  8 of 28 SKIP test unique_stg_customers_customer_id.............................. [SKIP]
02:30:26  9 of 28 SKIP test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [SKIP]
02:30:26  10 of 28 SKIP test not_null_stg_orders_order_id................................. [SKIP]
02:30:26  11 of 28 SKIP test unique_stg_orders_order_id................................... [SKIP]
02:30:26  4 of 28 OK created view model main.stg_payments................................. [OK in 0.22s]
02:30:26  12 of 28 START test accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [RUN]
02:30:26  13 of 28 START test not_null_stg_payments_payment_id............................ [RUN]
02:30:26  14 of 28 START test unique_stg_payments_payment_id.............................. [RUN]
02:30:26  13 of 28 PASS not_null_stg_payments_payment_id.................................. [PASS in 0.16s]
02:30:26  12 of 28 PASS accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [PASS in 0.22s]
02:30:26  14 of 28 PASS unique_stg_payments_payment_id.................................... [PASS in 0.27s]
02:30:26  15 of 28 SKIP relation main.customers........................................... [SKIP]
02:30:26  16 of 28 SKIP relation main.orders.............................................. [SKIP]
02:30:26  17 of 28 SKIP test not_null_customers_customer_id............................... [SKIP]
02:30:26  18 of 28 SKIP test unique_customers_customer_id................................. [SKIP]
02:30:26  22 of 28 SKIP test not_null_orders_coupon_amount................................ [SKIP]
02:30:26  19 of 28 SKIP test accepted_values_orders_status__placed__shipped__completed__return_pending__returned [SKIP]
02:30:26  20 of 28 SKIP test not_null_orders_amount....................................... [SKIP]
02:30:26  21 of 28 SKIP test not_null_orders_bank_transfer_amount......................... [SKIP]
02:30:26  23 of 28 SKIP test not_null_orders_credit_card_amount........................... [SKIP]
02:30:26  24 of 28 SKIP test not_null_orders_customer_id.................................. [SKIP]
02:30:26  26 of 28 SKIP test not_null_orders_order_id..................................... [SKIP]
02:30:26  27 of 28 SKIP test relationships_orders_customer_id__customer_id__ref_customers_ [SKIP]
02:30:26  28 of 28 SKIP test unique_orders_order_id....................................... [SKIP]
02:30:26  25 of 28 SKIP test not_null_orders_gift_card_amount............................. [SKIP]
02:30:27  
02:30:27  Finished running 3 seeds, 3 view models, 20 tests, 2 table models in 1.08s.
02:30:27  
02:30:27  Completed with 2 errors and 0 warnings:
02:30:27  
02:30:27  Runtime Error in model stg_customers (models/staging/stg_customers.sql)
02:30:27    Catalog Error: Table with name raw_customers does not exist!
02:30:27    Did you mean "stg_customers"?
02:30:27  
02:30:27  Runtime Error in model stg_orders (models/staging/stg_orders.sql)
02:30:27    Catalog Error: Table with name raw_orders does not exist!
02:30:27    Did you mean "stg_orders"?
02:30:27  
02:30:27  Done. PASS=7 WARN=0 ERROR=2 SKIP=19 TOTAL=28

Ability to conditionally set/nullify `external_location`?

I'm working on a PR where we may want to have the option to either import raw data or load from seed files via the external_location annotation.

Is it possible to return a value in external_location that nullifies the expression?

Something like the psuedocode:

external_location: "{{ if {some value set} then {a real external location} else null }}"

Where in this case null would be intepreted as no instruction.

The alternative is to ask users to comment and uncomment the value, e.g.:

    # Uncomment these lines if you want to use the pregenerated seed files:
    # meta:
    #   external_location: "read_csv_auto('./jaffle-data/{name}.csv', header=1)"

https://github.com/MeltanoLabs/jaffle-shop-template/pull/1/files#diff-6c181d0ccc3c5e37e2feaacdd8872e09073b138835156160dcb5ee7c159f8c63L3-R9

Can you say if null today would cause an error, or just a no-op relative to the external_location operator?

dbt-duckdb should know about Excel

Essentially, any external file format or storage system that can read/write a pandas.DataFrame (or an arrow record reader, or a polars data frame, or a JSON/CSV file) and that has Python library support should be usable as either a) a source or b) an external materialization target in dbt-duckdb.

Reference: dbt-excel.com

Dbt models that are materialized as views result in catalog error (table cannot be found)

Dbt models that are materialized as views are unable to be found:

Catalog Error: Table with name <table_name> does not exist

The sql view model seems to get created without issues:

Concurrency: 1 threads (target='duckdb') 1 of 4 OK created sql view model

Running a select statement on the model returns the missing table error. Wondering if this is a bug/missing feature or if I am missing something in the setup?

I see that there is a duckdb specific create_view_as macro. Does a corresponding view.sql need to be created under the materializations folder?

https://github.com/dbt-labs/dbt-core/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/view/view.sql

upgrade to support dbt-core v1.5.0

Background

The latest version of dbt Core,dbt-core==1.5.0rc1, was published on April 13, 2023 (PyPI | Github).

How to upgrade

dbt-labs/dbt-core#7213 is an open discussion with more detailed information. If you have questions, please put them there!

The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.5.0 release of your adapter.

Add support Python 3.11 (if you haven't already)
Add support for relevant tests (there's a lot of new ones!)
Add support model contracts
~~Add support for materialized views~~ (this likely will be bumped to 1.6.0)

the next minor release: `1.6.0`

FYI, dbt-core==1.6.0 is expected to be released at the end of July, with a release cut at least two weeks prior.

Support for `s3_endpoint`

S3-compatible services require

SET s3_endpoint='<domain>.<tld>:<port>';

https://duckdb.org/docs/extensions/httpfs#configuration

Changing seeds columns fails

If I have seeds that look like this

id,value
1,'hello'
2,'world'

and then change to look like this

id,status
1,'hello'
2,'world'

then I get

00:40:48  Runtime Error in seed so2_status_codes (seeds/status_codes.csv)
00:40:48    Binder Error: Table "so2_status_codes" does not have a column with name "status"

And the table must be manually deleted in order to work around.

External materialization of csv requires a specific `format=csv` config, whereas parquet is inferred

We should make these both inferred based on the extension for a more intuitive DX.

Current code

{{ 
    config(
        materialized='external', 
        location='reports/models/customers.csv',
        format='csv'
    )
    
}}

Would ideally be

{{
    config(
        materialized='external', 
        location='reports/models/customers.csv',
    )
}}

Please add a license

I'd like to kick the tires on this project but it's not clear what the license is. Can you add one?

fwiw DuckDB is MIT licensed and dbt-core is Apache. A license compatible with both of these would be ideal.

Models that depend on external models do not automatically register their dependencies as views

Suppose we specify +materialization: external in our dbt_project.yml.

Initially we run all dbt computations using an in-memory db, materializing all the models as parquet files.

Subsequently we change one dbt model A, and want to recompute model A (and possibly downstream dependencies as well).

Now, if A depends on upstream models B and C, trying to run just A will result in a failure. Because, B and C are not registered as views yet (in-memory db!), despite all tables being specified as external. This forces the user to always recompute upstream dependencies just to trigger the registration logic https://github.com/jwills/dbt-duckdb/blob/master/dbt/include/duckdb/macros/materializations/external.sql#L51-L55

I'd like to propose putting in a check in the materialization macros to

loop through the parent nodes
for each parent_node, if load_cached_relation(parent_node) is None, call the registration logic.

Does this sound reasonable? Happy to send a PR.

upgrade to support dbt-core v1.3.0

Background

The latest release cut for 1.3.0, dbt-core==1.3.0rc2 was published on October 3, 2022 (PyPI | Github). We are targeting releasing the official cut of 1.3.0 in time for the week of October 16 (in time for Coalesce conference).

We're trying to establish a following precedent w.r.t. minor versions:
Partner adapter maintainers release their adapter's minor version within four weeks of the initial RC being released. Given the delay on our side in notifying you, we'd like to set a target date of November 7 (four weeks from today) for maintainers to release their minor version

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Catalog Error: Could not rename

Hi,

really impressed with how fast you implemented attach database! Looking forward to start working more with dbt + duckdb 😄

I'm trying it out now by mimicking databricks medallion with a separate database for bronze, silver and gold. First run always works like a charm, but with subsequent runs I get "Catalog Error: Could not rename". Could it be that there is some kind of restrictions with renaming across attached databases?

Changing back to only use the "main" database, the one added without attach, it works every time.

I'll try to force the same error from jaffle shop example or something similar, but probably won't have time until the weekend.

Support for duckdb 0.6.*

Hi,

duckdb 0.6.0 has been released recently - it would be great if dbt-duckdb could support the new version.

Thanks!

update macro statement error handling to provide helpful messaging to user

Howdy partner,

I ran into an error creating a minimal dbt-duckdb example. I wish I could give more information but I'm a dbt newbie. Below is a repository that (hopefully) recreates this issue w/ a couple of essentially trivial model files:

https://github.com/hamilton/dbt-duckdb-test

dbt run gives this output:

00:08:05  Running with dbt=1.0.1
00:08:05  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:08:05  
00:08:05  Concurrency: 1 threads (target='dev')
00:08:05  
00:08:05  1 of 2 START table model main.ingest............................................ [RUN]
00:08:05  1 of 2 ERROR creating table model main.ingest................................... [ERROR in 0.08s]
00:08:05  2 of 2 SKIP relation main.second................................................ [SKIP]
00:08:05  
00:08:05  Finished running 1 table model, 1 view model in 0.34s.
00:08:05  
00:08:05  Completed with 1 error and 0 warnings:
00:08:05  
00:08:05  Compilation Error in macro statement (macros/etc/statement.sql)
00:08:05    cannot unpack non-iterable NoneType object
00:08:05    
00:08:05    > in macro materialization_table_default (macros/materializations/models/table/table.sql)
00:08:05    > called by macro statement (macros/etc/statement.sql)
00:08:05  
00:08:05  Done. PASS=0 WARN=0 ERROR=1 SKIP=1 TOTAL=2

My profiles.yml:

duckdb_test:
  outputs:
    dev:
      path: /Users/hamiltonulmer/Code/tmp/duckdb_test/db.duckdb
      type: duckdb
  target: dev

Bug in register_upstream_external_models

I have an issue when running a dbt run --select statement where external tables are referenced. An invalid upstream location is generated, resulting in the error e.g. IO Error: No files found that match the pattern "external/root/.parquet" when running for the config external_root: "external/root".

I've identified the problem in macros/utils/upstream.sql, and I'll put up a PR shortly. I'm not sure how to write a test that reproduces this issue though, so I'd appreciate it if you could maybe give me some pointers there. Thanks!

Most recent releases missing from Github

The most recent release on Github is 1.3.1, but pypi is up to 1.3.4: https://pypi.org/project/dbt-duckdb/
Not sure if intentional but wanted to flag in case it wasn't. Thanks so much for this package, it's incredibly useful!

Binder Error: Catalog "main" does not exist!

Core:
  - installed: 1.3.2
  - latest:    1.4.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - duckdb: 1.3.4 - Up to date!

Running the initial model dbt provides I'm getting this error

23:32:59  Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
23:32:59    Binder Error: Catalog "main" does not exist!

this is my profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb

Pip install not working

Super excited about this project. I love duckdb and tying it in with dbt is a great idea. I tried pip installing this package per the readme but a version on Pypi couldn't be found. Has it been published to Pypi yet?

Cannot get glue integration to work

I am getting fond of this library as it makes it easy to interface dbt with our existing pipelines which persist data in s3.

I can read/write data to s3 without issues, but I was curious about the glue integration, however I cannot get it to work.
My pip dependencies are the following:

attrs                    22.2.0
Babel                    2.11.0
boto3                    1.26.57
botocore                 1.29.57
certifi                  2022.12.7
cffi                     1.15.1
chardet                  5.1.0
charset-normalizer       3.0.1
click                    8.1.3
colorama                 0.4.5
dbt-core                 1.3.2
dbt-duckdb               1.3.4
dbt-extractor            0.4.1
diff-cover               7.4.0
duckdb                   0.6.1
exceptiongroup           1.1.0
future                   0.18.3
hologram                 0.0.15
idna                     3.4
iniconfig                2.0.0
isodate                  0.6.1
Jinja2                   3.1.2
jmespath                 1.0.1
jsonschema               3.2.0
leather                  0.3.4
Logbook                  1.5.3
MarkupSafe               2.1.2
mashumaro                3.0.4
minimal-snowplow-tracker 0.0.2
msgpack                  1.0.4
mypy-boto3-glue          1.26.53
networkx                 2.8.8
numpy                    1.24.1
packaging                21.3
parsedatetime            2.4
pathspec                 0.9.0
pip                      22.0.2
pluggy                   1.0.0
pycparser                2.21
Pygments                 2.14.0
pyparsing                3.0.9
pyrsistent               0.19.3
pytest                   7.2.1
python-dateutil          2.8.2
python-slugify           7.0.0
pytimeparse              1.1.8
pytz                     2022.7.1
PyYAML                   6.0
regex                    2022.10.31
requests                 2.28.2
s3transfer               0.6.0
setuptools               59.6.0
six                      1.16.0
sqlfluff                 1.2.1
sqlparse                 0.4.3
tblib                    1.7.0
text-unidecode           1.3
toml                     0.10.2
tomli                    2.0.1
tqdm                     4.64.1
typing_extensions        4.4.0
urllib3                  1.26.14
Werkzeug                 2.2.2

I want to try to create a glue table based on some seed data. Similar to the jaffle_shop customers.csv for example.
My model looks as follows:

{{ config(materialized='external', glue_register=true, glue_database='my_glue_db') }}
with source as (
    select * from {{ ref('raw_customers') }}
)

select * from source

In my profiles.yml I did not change anyhting compared to what I used for writing to s3:

    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      schema: dbt
      threads: 1
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: 'eu-west-1'
        s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
        s3_session_token: "{{ env_var('AWS_SESSION_TOKEN') }}"
      external_root: "s3://<somepath>

The result is that dbt compile finishes successfully, it writes my customers.parquet file to s3 but it does not create the table in glue.
I am not sure how to debug this, in the dbt.logs I did not find anything useful.
Can you point me in the right direction? Am I missing something?

Give a duck some ❤️

Ducks are fantastic animals but need some care, just like a code. Here is a repository for scaffolding dbt adapter. Unfortunately, this awesome project does not comply to that, which makes contribution more challenging than it has to be.

Would it be okay if I brought the structure closer to the template?

dbt models cannot see each other when using the :memory: database.

Describe the bug

dbt models cannot see each other when using the :memory: database.

To Reproduce

Steps to reproduce the behavior:

create two models. table_one.sql & table_two.sql
table_one.sql should be "SELECT 'test' AS value".
table_two.sql should be "SELECT * FROM {{ ref( 'table_one' ) }}
table_two will fail to build with following error: "Catalog Error: Table with name table_one does not exist"

Expected behavior

dbt build should succeed.

Your environment

Version of dbt-duckdb: ~=1.2.0

External sources: more options & docs

I am interested in using external sources. What I am missing in either features or the documentation (if the feature already exists):

How do I set the delimiter for my CSV file?
How do I specify a local path?
What do the list of tables mean and how can I document that? The CSV/parquet would be a single table, why are there multiple in the docs?

Python model fails with import statement outside of function

Python models aren't working as I expect. This basic example fails:

import pandas as pd

def model(dbt, session):

    return pd.Series(range(10)).to_frame()

15:32:53  Runtime Error in model 03 (days/03.py)
15:32:53    Python model failed:
15:32:53    name 'pd' is not defined

However, simply moving the import statement into the function allows it to work.

def model(dbt, session):
    import pandas as pd
    return pd.Series(range(10)).to_frame()

Versions:

dbt-duckdb 1.3.2
dbt 1.3.1
Python 3.10.7

Exception in WAL playback: Catalog Error: Table does not exist

Hey @jwills

I got the following error when hacking on the demo published by the fal.ai folks and using the dbt-duckdb 1.2.1 multi-threading:

Exception in WAL playback: Catalog Error: Table with name customers__dbt_tmp does not exist!
Did you mean "customers"?

This demo is using a forked and customized beta version of dbt-core + the demo fal adapter + dbt-duckdb multi-threading, so who knows what is actually causing the issues -- may or may not be fixable within dbt-duckdb 🤷

Instructions to reproduce

Grab the code:

git clone https://github.com/dbeatty10/jaffle_shop_with_fal.git
cd jaffle_shop_with_fal
git checkout dbeatty/reproduce-bug

Create a virtual environment:

python3 -m venv env
source env/bin/activate
pip install --upgrade pip

Install the requirements and build:

pip install -r requirements.txt
source env/bin/activate
rm jaffle_shop.duckdb
dbt build

Deleting the jaffle_shop.duckdb file was crucial to reproduce the error for me -- if all the output tables exist already, everything works fine.

Work-around

Editing my requirements.txt to this and re-running the install+build steps worked again:

# Works with 1.2.0
dbt-duckdb==1.2.0

# Doesn't work with 1.2.1
# dbt-duckdb==1.2.1

# Use a local editable version of dbt-duckdb instead:
# -e /Users/dbeatty/projects/dbt-duckdb

Support PyArrow Dataset as a valid return type for Python models

From the docs:

The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory, and multi-file datasets.

Since pyarrow.table type won't work with larger than memory datasets, this might be a good alternative. I was also thinking about Polars but wanted to check first if this was easier to implement.

upgrade to support dbt-core v1.4.0

Background

The latest version of dbt Core,dbt-core==1.4.0, was published on January 25, 2023 (PyPI | Github). In fact, a patch, dbt-core==1.4.1 (PyPI | Github), was also released on the same day.

How to upgrade

dbt-labs/dbt-core#6624 is an open discussion with more detailed information. If you have questions, please put them there! dbt-labs/dbt-core#6849 is for keeping track of the community's progress on releasing 1.4.0

The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.4.0 release of your adapter.

support Python 3.11 (only if your adapter's dependencies allow)
Consolidate timestamp functions & macros
Replace deprecated exception functions
Add support for more tests

the next minor release: 1.5.0

FYI, dbt-core==1.5.0 is expected to be released at the end of April. Please plan on allocating a more effort to upgrade support compared to previous minor versions. Expect to hear more in the middle of April.

At a high-level expect much greater adapter test coverage (a very good thing!), and some likely heaving renaming and restructuring as the API-ification of dbt-core is now well underway. See https://github.com/dbt-labs/dbt-core/milestone/82 for more information.

S3 reading error "No files found that match the pattern"

Hi!

I am trying to read a parquet file from S3 by following the documentation, but for some reason I cannot seem to get it to work..

I am using dbt-duckdb==1.3.4.

When trying to read a parquet file from S3 running dbt build I get:

20:28:45    IO Error: No files found that match the pattern "s3://my-bucket/my-project/data.parquet" 
20:28:45  
20:28:45  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=

The s3://my-bucket/my-project/data.parquet is just a dummy string obviously, but I do have a Bucket with a parquet file in there which I replace with the actual string for testing.

What I did:

First I set the S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables, which are correct and working.

Then I have this in my profiles.yml:

my_dbt_project:
  outputs:
    dev:
      type: duckdb
      path: ":memory:"
    extensions:
      - httpfs
      - parquet
    settings:
      s3_region: eu-west-1
      s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
      s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

And this in sources.yml:

 version: 2

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/my-project/data.parquet"
    tables:
      - name: my_table

and trying to run this

select * from {{ source("external_source", "my_table") }}

But it keeps saying it cannot find the file. I tried to specify the external_root in profiles.yml but no luck..

Any ideas what I am missing.

Glue registration not working "Format is not supported in Glue registrar."

When running a single model aimed to upload a seed csv to a parquet file on S3 and registering it with AWS Glue the following error appears:
"Format is not supported in Glue registrar."

Looking into the code it seems the file format is not passed properly into the "register_glue_table" function in impl.py.

example model:

{{
  config(
    materialized = 'external',
    format = 'parquet',
    location = 's3://example-bucket/example/data.parquet',
    glue_register = true
    )
}}
select 'dbt-duckdb rocks'

with profile:

example:
  outputs:
    dev:
      type: duckdb
      path: ":memory:"
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: 'eu-central-1'
        s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
      format: 
  target: dev

I'd love to help out with submitting a fix, only not sure if this needs to be done in the macro or the python code.
I'll try to create a PR to fix it

Support aws webidentity token credentials when interacting with s3 on duckdb

When looking through the code, I noticed that duckdb itself does not support the default aws credentials provider chain. Probably because they do not want to have external dependencies.

I am using dbt-duckdb and for the glue integration you are correctly using the credential provider chain if you do not pass the credentials explicitly. I was wondering if you also want to provide this for the s3 integration?

Next to providing the static credentials through settings properties (as exists at the moment), you could use the boto3 client to get and update credentials when they are expired. These credentials can then be used in the DuckDBConnectionWrapper such that they are always valid and any authentication mechanism can be used.

Example usecase for us:
we use aws irsa for our pods on kubernetes and get the following credentials. Basically a roleArn and a webIdentityToken file, which gets updated every hour. Before I can run dbt commands, I now need to convert these aws credentials to the correct format as is accepted by duckdb as follows:

import boto3

session = boto3.Session()
AWS_ACCESS_KEY_ID = session.get_credentials().access_key
AWS_SECRET_ACCESS_KEY = session.get_credentials().secret_key
AWS_SESSION_TOKEN = session.get_credentials().token

Only then I can execute dbt compile ...
Apart from annoying entrypoint, this has as disadvantage that if my job would take longer than an hour, the credentials would expire...

I think it would be great it the dbt-duckdb project would do this for me. Since you already depend on boto3 the impact is minimal and in line with what you already implemented for the glue integration. What do you think?

upgrade to support dbt-core v1.3.0

Background

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

Temp table issues with duckdb >= 0.2.6

I tried bumping the duckdb version up to 0.2.8 (latest as of right now), and ran into a few issues. A couple are minor and I think easily addressed (I can open a PR if desired), but I hit a deeper snag running the pytest suite.

Whenever the test does a CTAS to create a temp table, a subsequent query doesn't see the temp table and fails.

I bisected the git history of duckdb and found duckdb/duckdb#1485 introduced the change, which was released with 0.2.6. But that change looks correct, and the older behavior was in fact a bug in duckdb that was masking another problem (now exposed).

I believe what's happening is an issue in duckdb's handling of multiple cursors on a single connection, but posting here first for advice and in case anyone else is looking to use dbt + duckdb with a recent version.

In duckdb, each call to conn.cursor() creates a new connection. So if a client does:

conn.cursor().execute('create temp table foo as (select 42)')
conn.cursor().execute('select * from foo')

The select will hit an error, because it is running on a separate connection/session and doesn't see table foo. This incorrectly worked previously because the CTAS was not actually creating a temp table.

This code fragment looks a bit silly here, but in the context of dbt this is effectively what is happening as the generic connection logic opens a new cursor prior to each query (see SQLConnectionManager.add_query in dbt).

Other databases (well, Postgres/psycopg2 at least) will share the connection across these cursors.

One potential solution is to open an issue with duckdb to see about connection-sharing across cursors (it may be the same as duckdb/duckdb#1848, in fact). Another is to override the add_query method in dbt's SQLConnectionManager to cache a single cursor for reuse across queries.

Any suggestions?

Support `DuckDBPyRelation` as a valid return type for Python models

Right now, the dbt-duckdb python models only support Pandas DataFrames and PyArrow Tables as valid return types for dbt Python models; we should extend this support to include the native DuckDBPyRelation data type as well.

S3 Parquet Support

Hi There!

DuckDB has the support to load Parquet files from S3. Are there any plans on supporting the ability to query from S3 Parquet files from dbt-duckdb? Would love to help in any way I can!

Database increases by initial size after each dbt run is executed

I have a project running on Windows 10 that contains only views (no tables materialized) that select from sources using meta_external-location read_parquet statements. I have found if I delete the database and then successively execute dbt run commands the database file grows in size by approximately its initial run size.

The size of the database file for each successive run is:

29,964 kb
58,380 kb
86,796 kb
115,212 kb

Thanks for all your work on this, I'd love to contribute but I'm not sure where to start looking

External materialization file paths do not support jinja references

Consider the following config block:

{{ 
    config(
        materialized='external', 
        location='/path/to/folder/{{ this.name }}.parquet'
) }}

This should write the model name to the folder specified.

However, what happens is the name is written literally as {{ this.name }}.parquet as seen in the screenshot below.

I would expect that the external materialization works with the same syntax as this (but it does not):

COPY (SELECT * FROM {{ this }} ) TO '/tmp/data_catalog/conformed/{{ this.name }}.parquet' (FORMAT 'parquet')

Incremental external models

Do you think it's feasible to add support for incremental materialisation to external location? Ie. extending the external materialisation to support incremental semantics.

Namely I was thinking of the simplest case of an "append" strategy implemented as just writing a new parquet file. And the model overall would be defined with a glob location.

Do you see any drawbacks of such an approach? I'm happy to submit a PR if you point me in the right direction in terms of what API you'd expect this to have? Make this part of the external materialisation, or of the incremental one, or a completely new one?

Thanks
Nuno

Out of sync with minor version of dbt-core

I'll open a PR shortly with a proposed fix to the following minor (😆) issue.

TLDR

Installing dbt-duckdb 1.1.1 yields dbt-core 1.0.8 (instead of 1.1.1).

Install commands

python3 -m venv env
source env/bin/activate
python -m pip install dbt-duckdb
source env/bin/activate
dbt --version

Expected output

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - duckdb: 1.1.1 - Up to date!

Actual output

installed version: 1.0.8
   latest version: 1.1.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - duckdb: 1.1.1 - Up to date!

The fix is simple, and I'll submit a PR.

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

Create a simple plugin system for writing data to external destinations

The same as in #141, though, now for writing data to external destinations.

Motivation

The reasoning is similar to reading. Sometimes the data has to be presented in a different system. For example, I use jdbc tables in Spark to write data to a SQL server. The SQL server might be connected to a dashboard tool, which can be preferred over connecting the Spark cluster.

The general approach for the plugins is fantastic! Users could implement anything; maybe they want to send an email with a report summarizing the data quality of a data model.

dbt models that are parquet files

I haven't thought through this deeply. It might not make sense, it might require changes to dbt, or it might already work? But I wanted to raise it just in case, because it would help me out with something I'm building.

Could we enable configuring dbt-duckdb such that

select
    bla
from {{ ref("orders") }}

compiles to

select
    bla
from 's3://bucket/orders.parquet'

I am having issues to get register_upstream_external_models() to work

I stripped down my usecase to the following basic example:

stg layer: 1 model that outputs intermediary data as an external csv file (stg_orders)
marts layer: 1 model that uses the staging layer to do some small calculation (orders)

Instead of running the whole project as one (which works) I wanted to split it up into 2 steps and use the register_upstream_external_models() to make sure that this would work.

When I run only the marts package, it fails because it says it cannot find the table.

�[0m09:54:37.537998 [error] [MainThread]: �[33mRuntime Error in model orders (models/marts/orders.sql)�[0m
�[0m09:54:37.539425 [error] [MainThread]:   Catalog Error: Table with name stg_orders does not exist!
�[0m09:54:37.540547 [error] [MainThread]:   Did you mean "temp.pg_catalog.pg_views"?
�[0m09:54:37.541653 [error] [MainThread]:   LINE 11: select * from "coffee_shop_test"."main"."stg_orders"

When I look in the logs I see the following:

�[0m09:54:37.225208 [debug] [MainThread]: Writing runtime sql for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.227409 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.228500 [debug] [MainThread]: On master: BEGIN
�[0m09:54:37.229899 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.230767 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.231775 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "connection_name": "master"} */
create schema if not exists main
�[0m09:54:37.233732 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.237245 [debug] [MainThread]: Writing runtime sql for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.238110 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.238579 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "connection_name": "master"} */
create or replace view "coffee_shop_test"."main"."stg_orders" as (
            select * from 'coffee-data/stg_orders.csv'
          );
�[0m09:54:37.240719 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.243547 [debug] [MainThread]: Writing injected SQL for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.247490 [info ] [MainThread]: 1 of 1 START hook: coffee_shop.on-run-start.0 .................................. [RUN]
�[0m09:54:37.249385 [info ] [MainThread]: 1 of 1 OK hook: coffee_shop.on-run-start.0 ..................................... [�[32mOK�[0m in 0.00s]
�[0m09:54:37.251099 [info ] [MainThread]: 
�[0m09:54:37.252567 [debug] [MainThread]: On master: ROLLBACK
�[0m09:54:37.254094 [debug] [MainThread]: Failed to rollback 'master'
�[0m09:54:37.254808 [debug] [MainThread]: On master: Close

�[0m09:54:37.259425 [info ] [MainThread]: Concurrency: 1 threads (target='dev')
�[0m09:54:37.260726 [info ] [MainThread]: 
�[0m09:54:37.269843 [debug] [Thread-1 (]: Began running node model.coffee_shop.orders
�[0m09:54:37.271039 [info ] [Thread-1 (]: 1 of 1 START sql external model main.orders .................................... [RUN]
�[0m09:54:37.273265 [debug] [Thread-1 (]: Acquiring new duckdb connection 'model.coffee_shop.orders'
�[0m09:54:37.274562 [debug] [Thread-1 (]: Began compiling node model.coffee_shop.orders
�[0m09:54:37.289179 [debug] [Thread-1 (]: Writing injected SQL for node "model.coffee_shop.orders"
�[0m09:54:37.290757 [debug] [Thread-1 (]: Timing info for model.coffee_shop.orders (compile): 2023-02-24 09:54:37.275617 => 2023-02-24 09:54:37.290464
�[0m09:54:37.291590 [debug] [Thread-1 (]: Began executing node model.coffee_shop.orders
�[0m09:54:37.392108 [debug] [Thread-1 (]: Using duckdb connection "model.coffee_shop.orders"
�[0m09:54:37.393062 [debug] [Thread-1 (]: On model.coffee_shop.orders: BEGIN
�[0m09:54:37.393669 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m09:54:37.403457 [debug] [Thread-1 (]: SQL status: OK in 0 seconds
�[0m09:54:37.404068 [debug] [Thread-1 (]: Using duckdb connection "model.coffee_shop.orders"
�[0m09:54:37.404537 [debug] [Thread-1 (]: On model.coffee_shop.orders: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "node_id": "model.coffee_shop.orders"} */

  
    

    create  table
      "coffee_shop_test"."main"."orders__dbt_tmp"
    as (
      

select * from "coffee_shop_test"."main"."stg_orders"
    );
  
�[0m09:54:37.440405 [debug] [Thread-1 (]: Timing info for model.coffee_shop.orders (execute): 2023-02-24 09:54:37.292091 => 2023-02-24 09:54:37.440173
�[0m09:54:37.440917 [debug] [Thread-1 (]: On model.coffee_shop.orders: ROLLBACK
�[0m09:54:37.443434 [debug] [Thread-1 (]: Failed to rollback 'model.coffee_shop.orders'
�[0m09:54:37.443952 [debug] [Thread-1 (]: On model.coffee_shop.orders: Close
�[0m09:54:37.448232 [debug] [Thread-1 (]: Runtime Error in model orders (models/marts/orders.sql)
  Catalog Error: Table with name stg_orders does not exist!
  Did you mean "temp.pg_catalog.pg_views"?
  LINE 11: select * from "coffee_shop_test"."main"."stg_orders"

The generated sql seems ok, but for some reason it seems to want to rollback, which is probably why it subsequently dbt model does not find the stg_orders view. It is strange that I do not see the actual error that caused the rollback. When I execute the create view statement manually in duckdb it is correct.

stg_orders.sql:

{{
    config(
        materialized = 'external',
        location='coffee-data/stg_orders.csv',
        format='csv'
    )
}}

select range from range(3)

orders.sql:

{{
    config(
        materialized = 'external',
        location='coffee-data/orders.csv',
        format='csv'
    )
}}

select * from {{ ref('stg_orders') }}

If I change the reference in the orders.sql to a source reference it works of course but is not what I want.
Am I missing something or is there something wrong with the register_upstream_external_models function?
Is the actual error swalled somehow, any idea how I could debug this further?

I am using dbt-duckdb 1.4.0 as well as dbt 1.4.0 and duckdb 0.6.1

Parquet materialization

👋

Hey!

Something I've been thinking about is using parquet as a materialization. To be specific, one where dbt-duckdb would use the underlying duckdb connection as a shim to edit parquet rather than adding some tables into the .duckdb file.

I'm not sure if it's possible to override what ref does in an adapter, but this would roughly need to do two things:

Make a new materialization that just took the query for some model and executes
copy (select ...) to 'model_name.parquet' (format parquet) rather thancreate table model_name as (select ...) for parquet materialized models.
Update ref to use read_parquet(ref'd model name) rather than schema.table_name if the model being referenced was materialized as parquet.

Would love to hear what other people think / if this would be useful!

This is loosely related to: #15

Python models

It would be great to have a support for dbt Python Models. My use case is primarily to read/write into Delta tables (hope it will be obsolete with duckdb-deltatable-extension), but there are many other possibilities.

I have already created a prototype, but it is an ugly duckling that supports only single thread (which is okish for my use-case). But I do not expect PR to be merged anytime soon.

P.S. I have created this issue because it is always nice to have both PR and issue :)

External materializations memory alloc issue

So not exactly sure what is going on here, the error coming back from dbt is not super clear. It looks duckdb is running out of memory.

The error is:

/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

More context
Before external materializations, I was running my dbt project and then copying the data out to specific folders with an on-run-end macro. This allowed for me to run somewhere in the neighborhood of 100,000 simulations in my project (found over here) on a VM with 8GB of RAM without seeing this issue. The default number of simulations in the project is typically 10k to keep it fast.

However, when using external materializations, I have to reduce the run size to 1k in order for it to run successfully. This leads me to believe that there is a memory “leak” inside of DuckDB. If I were to speculate, DuckDB is holding both the external tables and the duckdb tables in memory, instead of dropping the duckdb tables once the file has been exported to its external storage location.

Handle column names with spaces

I know it's a bad practice, but you can't always help it that is how you get your source data :)

If column names have spaces and you define tests for them, then you get errors.

E.g.

          - name: "van tijdstip"
            type: time
            description: Start time of the quarter
            tests:
              - not_null

20:24:34    Parser Error: syntax error at or near "tijdstip"
20:24:34    LINE 15: where van tijdstip is null

Probably need to surround all column names with double quotes.

The database file was created with an newer version of DuckDB.

Thanks for developing both dbt-duckdb and target-duckdb!

Being new to the whole meltano, dbt, duckdb stack I tried to modify the official meltanto tutorial by replacing postgres with duckdb

cat meltano.yml 
version: 1
default_environment: dev
project_id: b1cb5b90-6956-4199-b7ed-e10092dfb5da
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-github
    variant: singer-io
    pip_url: tap-github
    config:
      repository: palaimon/ipyannotator palaimon/fastfm2
      start_date: ' 2016-11-28'
    select:
    - commits.url
    - commits.sha
    - commits.*
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4
    config:
      add_metadata_columns: true
      filepath: dump.duckdb
      default_target_schema: tap_github
  transformers:
  - name: dbt-duckdb
    variant: jwills
    pip_url: dbt-core~=1.2.0 dbt-duckdb~=1.2.0
    config:
      path: <my local path>/dump.duckdb

and run into the following error message.

dbt.exceptions.RuntimeException: Runtime Error
  IO Error: Trying to read a database file with version number 39, but we can only read version 38.
  The database file was created with an newer version of DuckDB.
  
  The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
  The storage will be stabilized when version 1.0 releases.
  
  For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

The error Trying to read a database file with version number 39, but we can only read version 38. seems pretty clear but I wasn't able to find a way to pin the dbt-duckdb and target-duckdb version in such a way that they are compatible.

A pointer on how to make dbt-duckdb and target-duckdb work together in meltano would be great.

External Location: f-string value needs more than one variable

I am interested in using external_location f-string with several variable but only name, indicator and schema are known in the f-string.

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/{folder2}/{meta.name}.{meta.suffix}"
    tables:
      - name: source1
         meta:
             suffix: parquet
             folder: folder1
      - name: source2
        meta:
              suffix: csv
              folder: folder2

Additionally, I don't want to overwrite the external_location for each table name. In this regard, I suggest to improve the f-string in real format name such as below:

- name: external_source
    meta:
      external_location: "s3://my-bucket/{source.meta['folder2']}/{source.meta['name']}.{source.meta['suffix'}"
    tables:
      - name: source1
         meta:
             suffix: parquet
             folder: folder1
      - name: source2
        meta:
              suffix: csv
              folder: folder2

In this case, the indicator should be addressed the source.indicator. I can do it myself, but please clarify that do you have any suggestion to improve this? I have identified the problem in duckdb/relation.py and send the PR to you, but as far as I'm not sure that you're agree with change I didn't modify the README. Additionally, the test should be improved in this regard, so I can do it myself after your suggestion.

add adapter to dbt docs's "Available Adapters" page

howdy @jwills

The Available Adapters page is one of the dbt community's most-visited docs pages. It would be of great benefit for first-time visitors to the dbt docs to see:

that this adapter is a possible option for using dbt-core, and
how many large the dbt ecosystem of support databases is.

dbt-labs/docs.getdbt.com#1489 exists to address this with all as-of-yet undocumented adapters.

We just released Documenting a new adapter, a new guide on how to add an adapter to the Available Adapters page. I'd love to see this adapter on that page, so feel free to reach out with any questions/blockers by either replying to this issue, or posting in the #adapter-ecosystem channel of the dbt Community Slack.

Looking forward to the contribution!

duckdb / dbt-duckdb Goto Github PK

dbt-duckdb's Introduction

dbt-duckdb

Installation

Configuring Your Profile

Using MotherDuck

DuckDB Extensions, Settings, and Filesystems

Fetching credentials from context

Attaching Additional Databases

Configuring dbt-duckdb Plugins

Using Local Python Modules

Reading and Writing External Files

Reading from external files

Writing to external files

Re-running external models with an in-memory version of dbt-duckdb

Python Support

Batch processing with Python models

Writing Your Own Plugins

Roadmap

dbt-duckdb's People

Contributors

Stargazers

Watchers

Forkers

dbt-duckdb's Issues

Background

How to upgrade

the next minor release: 1.6.0

Background

How to upgrade

Describe the bug

To Reproduce

Expected behavior

Your environment

Instructions to reproduce

Work-around

Background

How to upgrade

the next minor release: 1.5.0

Background

How to upgrade

TLDR

Install commands

Expected output

Actual output

Motivation

Recommend Projects

Recommend Topics

Recommend Org

the next minor release: `1.6.0`