Comments (4)
Thanks for the detailed report, this is however currently an expected limitation of our upsert implementation.
Our upsert handles constraint violations between existing data and new data, it seems your constraint violations are between new data (i.e you have duplicate rows in your new data)
I suggest using DISTINCT ON to filter out the duplicates in your new data.
https://duckdb.org/docs/sql/query_syntax/select#distinct-on-clause
from duckdb.
Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?
That's a good idea, that would help combat this common pitfall until we fix this limitation 👍
from duckdb.
Again, I will try to find the previous issue that had some of the same themes as this, but here is a workaround (for my case at least). The workaround is maybe a bit convoluted, but I think it works for my use which is to read a large number of files into a table. I expect there to be duplicates in the files (but these duplicates are duplicates across all rows, not just keys), and this is a process that may needs to be restarted without me being able to control which files have been inserted or not - so my program might try to insert rows already in the table. The files are small enough to fit in memory. The following should accommodate for that.
D CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
D INSERT INTO tbl VALUES (1, 42);
D INSERT INTO tbl (SELECT distinct * FROM ( VALUES (1,42),(1,84), (3,21)) AS tmp(a,b) ANTI JOIN tbl on tbl.i = tmp.a);
D SELECT * FROM tbl;
┌───────┬───────┐
│ i │ j │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 42 │
│ 3 │ 21 │
└───────┴───────┘
The anti join assures the rows have not already been inserted into the database, while the distinct clause assures no duplicate rows. Note that this workaround only works if any duplicate rows is exact the same, and does not have any not null constraints.
If the rows are not exactly the same (i.e they column(s) that make up the keys in the table you want to insert to are equal but other columns might be different) you have to make up a choice which rows will be inserted and which will be thrown away.
I assume the behavior of insert or ignore would be to insert the first row of the ones with duplicate columns while those who follows are ignored/not inserted due to violating the constraints.
I think the following code mimics that behavior:
--Table that is to be inserted into
CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
INSERT INTO tbl VALUES (1, 42);
--Table (or file) that is inserted from
CREATE TABLE tbl2 (i INT, j INT);
INSERT INTO tbl2 VALUES (1,8),(2,7),(2,9),(2,10),(1,3);
SELECT * FROM tbl2;
┌───────┬───────┐
│ i │ j │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 8 │
│ 2 │ 7 │
│ 2 │ 9 │
│ 2 │ 10 │
│ 1 │ 3 │
└───────┴───────┘
First lets look at how rows from the insert-data with the same values in primary key column(s), here column i, are handled
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t WHERE t.row_number = 1;
┌───────┬───────┬────────────┐
│ i │ j │ row_number │
│ int32 │ int32 │ int64 │
├───────┼───────┼────────────┤
│ 2 │ 7 │ 1 │
│ 1 │ 8 │ 1 │
└───────┴───────┴────────────┘
Combine the method for ensuring that the set is without duplicate values in the selected columns, with the method of not inserting rows with the same primary key value as existing rows
INSERT INTO tbl SELECT t.i, t.j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t ANTI JOIN tbl ON tbl.i = t.i WHERE t.row_number = 1;
SELECT * FROM tbl;
┌───────┬───────┐
│ i │ j │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 42 │
│ 2 │ 7 │
└───────┴───────┘
Since INSERT OR IGNORE works as long as there is not several violations of the primary constraint, for each primary key. (Inserting (1,42), (1,53) into the table where we already have (1,23)) we can also do the following:
INSERT OR IGNORE INTO TBL SELECT i, j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) WHERE t.row_number = 1;
Not implementing INSERT OR IGNORE
for several violating rows maybe hinges on not wanting to make the choice on which rows from a duplicate input set should be chosen. (In my opinion this would be using the method I outlined above), or maybe specifying f.ex MAX UNIXTIME
, ANY
, MIN VALUE
etc, where unixtime and value are columns if one want customized behaviour:
INSERT OR IGNORE INTO TBL SELECT * FROM tableDuplicates MAX UNIXTIME;
EDIT:
Thank you @Tishj for pointing out DISTINCT ON, which will solve the problem of trying to insert duplicate rows:
-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;
The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.
from duckdb.
Thank you @Tishj - wow that solves parts of this problem perfectly, I was not aware of DISTINCT ON.
-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;
The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.
Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?
from duckdb.
Related Issues (20)
- Incorrect/inconsistent replacement scans behavior HOT 1
- Catalog Error with nested CTEs
- Cannot reference column from CTE in a set returning function HOT 5
- json extension - fatal error during loading HOT 3
- Conversion Error: Could not convert string 'xx' to INT64 HOT 3
- DuckDB's sniff_csv doesn't correctly detect RFC 4180-compliant CSVs with a single column
- Error setting config `TimeZone` with Python client on `connect` HOT 2
- CLI fails to detect json array format when reading with STDIN HOT 2
- Export to arrow fails when the data contains empty struct
- read_csv treats leading zeros as VARCHAR during type auto-detection HOT 6
- quantile_cont/disc ORDER BY support HOT 3
- Selection to polars DataFrame fails HOT 1
- DuckDB secret_manager not retaining persistent secrets across sessions when using secret_directory
- `array_slice('duckDB', 5, NULL)` gives `NULL`, not `"DB"` HOT 2
- Lambda functions like `list_transform` work only for the first row? HOT 10
- Math function `ACOS` does not throw error outside [-1,1] HOT 3
- DuckDB crashes with the union operation of a long string and an array HOT 1
- DuckDB crashes with the UNNEST function. HOT 1
- When mixing and matching sqlite3 and duckdb, syntactically valid CREATE VIEW statements may fail
- DIVIDE() function does not return NAN or INFINITE, but NULL HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.