Coder Social home page Coder Social logo

Comments (4)

Tishj avatar Tishj commented on May 24, 2024 1

Thanks for the detailed report, this is however currently an expected limitation of our upsert implementation.
Our upsert handles constraint violations between existing data and new data, it seems your constraint violations are between new data (i.e you have duplicate rows in your new data)

I suggest using DISTINCT ON to filter out the duplicates in your new data.
https://duckdb.org/docs/sql/query_syntax/select#distinct-on-clause

from duckdb.

Tishj avatar Tishj commented on May 24, 2024 1

Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?

That's a good idea, that would help combat this common pitfall until we fix this limitation 👍

from duckdb.

bjornasm avatar bjornasm commented on May 24, 2024

Again, I will try to find the previous issue that had some of the same themes as this, but here is a workaround (for my case at least). The workaround is maybe a bit convoluted, but I think it works for my use which is to read a large number of files into a table. I expect there to be duplicates in the files (but these duplicates are duplicates across all rows, not just keys), and this is a process that may needs to be restarted without me being able to control which files have been inserted or not - so my program might try to insert rows already in the table. The files are small enough to fit in memory. The following should accommodate for that.

D CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
D INSERT INTO tbl VALUES (1, 42);
D INSERT INTO tbl (SELECT distinct * FROM ( VALUES (1,42),(1,84), (3,21)) AS tmp(a,b) ANTI JOIN tbl on tbl.i = tmp.a);
D SELECT * FROM tbl;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     142 │
│     321 │
└───────┴───────┘

The anti join assures the rows have not already been inserted into the database, while the distinct clause assures no duplicate rows. Note that this workaround only works if any duplicate rows is exact the same, and does not have any not null constraints.

If the rows are not exactly the same (i.e they column(s) that make up the keys in the table you want to insert to are equal but other columns might be different) you have to make up a choice which rows will be inserted and which will be thrown away.

I assume the behavior of insert or ignore would be to insert the first row of the ones with duplicate columns while those who follows are ignored/not inserted due to violating the constraints.

I think the following code mimics that behavior:

--Table that is to be inserted into
CREATE TABLE tbl (i INT PRIMARY KEY, j INT);
INSERT INTO tbl VALUES (1, 42);

--Table (or file) that is inserted from
CREATE TABLE tbl2 (i INT, j INT);
INSERT INTO tbl2 VALUES (1,8),(2,7),(2,9),(2,10),(1,3);
SELECT * FROM tbl2;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     18 │
│     27 │
│     29 │
│     210 │
│     13 │
└───────┴───────┘

First lets look at how rows from the insert-data with the same values in primary key column(s), here column i, are handled

SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t WHERE t.row_number = 1;
┌───────┬───────┬────────────┐
│   i   │   j   │ row_number │
│ int32 │ int32 │   int64    │
├───────┼───────┼────────────┤
│     271 │
│     181 │
└───────┴───────┴────────────┘

Combine the method for ensuring that the set is without duplicate values in the selected columns, with the method of not inserting rows with the same primary key value as existing rows

INSERT INTO tbl SELECT t.i, t.j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) t ANTI JOIN tbl ON tbl.i = t.i WHERE t.row_number = 1;
SELECT * FROM tbl;
┌───────┬───────┐
│   i   │   j   │
│ int32 │ int32 │
├───────┼───────┤
│     142 │
│     27 │
└───────┴───────┘

Since INSERT OR IGNORE works as long as there is not several violations of the primary constraint, for each primary key. (Inserting (1,42), (1,53) into the table where we already have (1,23)) we can also do the following:

INSERT OR IGNORE INTO TBL SELECT i, j FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY i) AS row_number FROM tbl2) WHERE t.row_number = 1;

Not implementing INSERT OR IGNORE for several violating rows maybe hinges on not wanting to make the choice on which rows from a duplicate input set should be chosen. (In my opinion this would be using the method I outlined above), or maybe specifying f.ex MAX UNIXTIME, ANY, MIN VALUE etc, where unixtime and value are columns if one want customized behaviour:

INSERT OR IGNORE INTO TBL SELECT * FROM tableDuplicates MAX UNIXTIME;

EDIT:

Thank you @Tishj for pointing out DISTINCT ON, which will solve the problem of trying to insert duplicate rows:

DISTINCT ON Clause

-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;

The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.

from duckdb.

bjornasm avatar bjornasm commented on May 24, 2024

Thank you @Tishj - wow that solves parts of this problem perfectly, I was not aware of DISTINCT ON.

DISTINCT ON Clause

-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;

The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not present, the first row that is encountered is not defined and can be any row in the table.

Should it be mentioned in the docs that upsert have to have no duplicate rows (per the target tables constraints), and that DISTINCT ON is the way to go?

from duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.