What happens? This was a little rabbit hole, that started out as g

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Insert or ignore throws error when inserting several constraint violating values about duckdb HOT 4 CLOSED

bjornasm commented on May 24, 2024

Insert or ignore throws error when inserting several constraint violating values

from duckdb.

Comments (4)

Tishj commented on May 24, 2024 1

Thanks for the detailed report, this is however currently an expected limitation of our upsert implementation.
Our upsert handles constraint violations between existing data and new data, it seems your constraint violations are between new data (i.e you have duplicate rows in your new data)

I suggest using DISTINCT ON to filter out the duplicates in your new data.
https://duckdb.org/docs/sql/query_syntax/select#distinct-on-clause

from duckdb.

Tishj commented on May 24, 2024 1

Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?

That's a good idea, that would help combat this common pitfall until we fix this limitation 👍

from duckdb.

bjornasm commented on May 24, 2024

Again, I will try to find the previous issue that had some of the same themes as this, but here is a workaround (for my case at least). The workaround is maybe a bit convoluted, but I think it works for my use which is to read a large number of files into a table. I expect there to be duplicates in the files (but these duplicates are duplicates across all rows, not just keys), and this is a process that may needs to be restarted without me being able to control which files have been inserted or not - so my program might try to insert rows already in the table. The files are small enough to fit in memory. The following should accommodate for that.

D CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
D INSERT INTO tbl VALUES (1, 42);
D INSERT INTO tbl (SELECT distinct * FROM ( VALUES (1,42),(1,84), (3,21)) AS tmp(a,b) ANTI JOIN tbl on tbl.i = tmp.a);
D SELECT * FROM tbl;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │    42 │
│     3 │    21 │
└───────┴───────┘

The anti join assures the rows have not already been inserted into the database, while the distinct clause assures no duplicate rows. Note that this workaround only works if any duplicate rows is exact the same, and does not have any not null constraints.

If the rows are not exactly the same (i.e they column(s) that make up the keys in the table you want to insert to are equal but other columns might be different) you have to make up a choice which rows will be inserted and which will be thrown away.

I assume the behavior of insert or ignore would be to insert the first row of the ones with duplicate columns while those who follows are ignored/not inserted due to violating the constraints.

I think the following code mimics that behavior:

--Table that is to be inserted into
CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
INSERT INTO tbl VALUES (1, 42);

--Table (or file) that is inserted from
CREATE TABLE tbl2 (i INT, j INT);
INSERT INTO tbl2 VALUES (1,8),(2,7),(2,9),(2,10),(1,3);
SELECT * FROM tbl2;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     8 │
│     2 │     7 │
│     2 │     9 │
│     2 │    10 │
│     1 │     3 │
└───────┴───────┘

First lets look at how rows from the insert-data with the same values in primary key column(s), here column i, are handled

SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t WHERE t.row_number = 1;
┌───────┬───────┬────────────┐
│   i   │   j   │ row_number │
│ int32 │ int32 │   int64    │
├───────┼───────┼────────────┤
│     2 │     7 │          1 │
│     1 │     8 │          1 │
└───────┴───────┴────────────┘

Combine the method for ensuring that the set is without duplicate values in the selected columns, with the method of not inserting rows with the same primary key value as existing rows

INSERT INTO tbl SELECT t.i, t.j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t ANTI JOIN tbl ON tbl.i = t.i WHERE t.row_number = 1;
SELECT * FROM tbl;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │    42 │
│     2 │     7 │
└───────┴───────┘

Since INSERT OR IGNORE works as long as there is not several violations of the primary constraint, for each primary key. (Inserting (1,42), (1,53) into the table where we already have (1,23)) we can also do the following:

INSERT OR IGNORE INTO TBL SELECT i, j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) WHERE t.row_number = 1;

Not implementing INSERT OR IGNORE for several violating rows maybe hinges on not wanting to make the choice on which rows from a duplicate input set should be chosen. (In my opinion this would be using the method I outlined above), or maybe specifying f.ex MAX UNIXTIME, ANY, MIN VALUE etc, where unixtime and value are columns if one want customized behaviour:

INSERT OR IGNORE INTO TBL SELECT * FROM tableDuplicates MAX UNIXTIME;

EDIT:

Thank you @Tishj for pointing out DISTINCT ON, which will solve the problem of trying to insert duplicate rows:

DISTINCT ON Clause

-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;

The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.

from duckdb.

bjornasm commented on May 24, 2024

Thank you @Tishj - wow that solves parts of this problem perfectly, I was not aware of DISTINCT ON.

DISTINCT ON Clause

-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;

The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.

Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?

from duckdb.

Insert or ignore throws error when inserting several constraint violating values about duckdb HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent