dlt version 0.4.7 Describe the problem <p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Nullable fields cause `column mismatch` error about dlt HOT 2 CLOSED

hello-world-bfree commented on September 22, 2024

Nullable fields cause `column mismatch` error

from dlt.

Comments (2)

rudolfix commented on September 22, 2024

@hello-world-bfree thanks for the bug report. I think it is (more or less) clear what is going on. if your data source is adding columns on the fly and the in-memory buffer for extracted data does not hold many rows (5000 by default) then we'll indeed write several parquet files with different schemas (you could increase the buffer size to a 1000 000 rows and see if that still happens: https://dlthub.com/docs/reference/performance#controlling-in-memory-buffers)

@steinitzu let's try to fix it. since all parquet files are loaded from a local storage we are able to get the column names and generate a COPY command per file

with maybe_context(lock):
            with sql_client.begin_transaction():
                sql_client.execute_sql(
                    f"COPY {qualified_table_name} FROM '{file_path}' ( FORMAT"
                    f" {source_format} {options});"
                )

right now we assume that duckdb will handle this itself and it is apparently not the case.

the above will work because dlt makes sure that all changes are added and schema is already migrated so all possible columns are present

from dlt.

steinitzu commented on September 22, 2024

@hello-world-bfree I'm pretty sure this was fixed in dlt 0.4.8 can you try updating? Latest version is 0.4.12
I was only able to replicate the bug on 0.4.7
There is normalization done for parquet files now where missing columns are added and columns are re-ordered as needed.

from dlt.

Recommend Projects

Nullable fields cause `column mismatch` error about dlt HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent