Comments (2)
@hello-world-bfree thanks for the bug report. I think it is (more or less) clear what is going on. if your data source is adding columns on the fly and the in-memory buffer for extracted data does not hold many rows (5000 by default) then we'll indeed write several parquet files with different schemas (you could increase the buffer size to a 1000 000 rows and see if that still happens: https://dlthub.com/docs/reference/performance#controlling-in-memory-buffers)
@steinitzu let's try to fix it. since all parquet files are loaded from a local storage we are able to get the column names and generate a COPY command per file
with maybe_context(lock):
with sql_client.begin_transaction():
sql_client.execute_sql(
f"COPY {qualified_table_name} FROM '{file_path}' ( FORMAT"
f" {source_format} {options});"
)
right now we assume that duckdb will handle this itself and it is apparently not the case.
the above will work because dlt makes sure that all changes are added and schema is already migrated so all possible columns are present
from dlt.
@hello-world-bfree I'm pretty sure this was fixed in dlt 0.4.8
can you try updating? Latest version is 0.4.12
I was only able to replicate the bug on 0.4.7
There is normalization done for parquet files now where missing columns are added and columns are re-ordered as needed.
from dlt.
Related Issues (20)
- Move `sql_database` source from verified-sources to dlt core HOT 1
- Move `filesystem ` source from verified-sources to dlt core HOT 1
- Rename "complex" data type to "json"
- Allow "dlt init" cli command to use sources from core
- Add actual file paths to traces in case of filesystem destination
- `scd2` merge strategy does not reinsert records HOT 3
- Add read_xml example
- Empty source loading fails with `delta` table format on remote filesystems for not-yet-existing tables
- `dlt pipeline -v <pipeline> trace` source password not redacted HOT 1
- Replace full_refresh to dev_mode in all examples
- support staging for duckdb and motherduck
- Dangling Parquet files in `delta` table
- Deeply nested structures can produce file names that exceed system limits HOT 2
- Docs: align documentation style with the new dlt website
- Allow for "reduce" steps in LanceDB adapter HOT 3
- Deployment guide for docker/ecs and AWS Lamdba
- [ci] move private test github deployment key from code to CI
- 'NoneType' object cannot be converted to 'PyString' error when using azure credentials with sas_token
- Docs: rest_api: add documentation for `map` and `filter` processing steps
- LanceDB Orphan Removal via Staging and Deletion Inserts.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dlt.