Coder Social home page Coder Social logo

Comments (5)

TSchiefer avatar TSchiefer commented on July 18, 2024

The above was created with dbplyr v2.3.0, with v2.2.0 we get the expected behavior, i.e. a 2-row tibble without the need for compute():

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.2.0'
lcl_tbl <- tibble(
  a = as.integer(c(1, 2, 1)),
  e = c("c", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)

src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl <- copy_to(src_duckdb, lcl_tbl)
rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a))
#> # Source:   SQL [2 x 4]
#> # Database: DuckDB 0.5.0 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     f       aef
#>   <int> <chr> <lgl> <dbl>
#> 1     1 c     TRUE      1
#> 2     2 b     FALSE     2

Created on 2023-02-14 with reprex v2.0.2

This explains why we see the unexpected behavior on all remote sources. The duckdb version has apparently nothing to do with our problem.

from dm.

TSchiefer avatar TSchiefer commented on July 18, 2024

Problem seems to lie in the query created by dbplyr v2.3.0:

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl <- tibble(
  a = as.integer(c(1, 2, 1)),
  e = c("c", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)

src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl <- copy_to(src_duckdb, name = "rmt_tbl", lcl_tbl)

rmt_tbl %>% 
  distinct() %>% 
  show_query()
#> <SQL>
#> SELECT DISTINCT *
#> FROM rmt_tbl

rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a)) %>% 
  show_query()
#> <SQL>
#> SELECT DISTINCT
#>   *,
#>   CASE
#> WHEN (NOT((a IS NULL))) THEN ROW_NUMBER() OVER (PARTITION BY (CASE WHEN ((a IS NULL)) THEN 1 ELSE 0 END) ORDER BY a)
#> END AS aef
#> FROM rmt_tbl

Created on 2023-02-15 with reprex v2.0.2

whereas for dbplyr c2.2.0 it was:

# [...] (same as above)
rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a)) %>% 
  show_query()
#> <SQL>
#> SELECT *, ROW_NUMBER() OVER (ORDER BY a) AS aef
#> FROM (
#>   SELECT DISTINCT *
#>   FROM rmt_tbl
#> ) q01

Created on 2023-02-15 with reprex v2.0.2

It looks like the order of distinct() and mutate() is switched by the translation to SQL code for dbplyr v2.3.0.

from dm.

TSchiefer avatar TSchiefer commented on July 18, 2024

As to the problem with flattening:

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl_1 <- tibble(
  a = 1:2,
  e = letters[1:2],
)

lcl_tbl_2 <- tibble(
  a = 1:2,
  e = letters[3:4],
)
lcl_tbl_3 <- tibble(
  a = 1:2,
  e = letters[5:6],
)
lcl_tbl_4 <- tibble(
  a = 1:2,
  e = letters[7:8],
)
lcl_tbl_1 %>% 
  left_join(rename(lcl_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(lcl_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(lcl_tbl_4, e_4 = e), by = "a")
#> # A tibble: 2 × 5
#>       a e     e_2   e_3   e_4  
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     c     e     g    
#> 2     2 b     d     f     h


src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl_1 <- copy_to(src_duckdb, name = "rmt_tbl_1", lcl_tbl_1)
rmt_tbl_2 <- copy_to(src_duckdb, name = "rmt_tbl_2", lcl_tbl_2)
rmt_tbl_3 <- copy_to(src_duckdb, name = "rmt_tbl_3", lcl_tbl_3)
rmt_tbl_4 <- copy_to(src_duckdb, name = "rmt_tbl_4", lcl_tbl_4)
rmt_tbl_1 %>% 
  left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_4, e_4 = e), by = "a")
#> # Source:   SQL [2 x 5]
#> # Database: DuckDB 0.6.2-dev1166 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     e_2   e_3   e_4  
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     c     e     e    
#> 2     2 b     d     f     f
rmt_tbl_1 %>% 
  left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_4, e_4 = e), by = "a") %>% 
  dbplyr::sql_render()
#> <SQL> SELECT rmt_tbl_1.*, rmt_tbl_2.e AS e_2, rmt_tbl_3.e AS e_3, rmt_tbl_3.e AS e_4
#> FROM rmt_tbl_1
#> LEFT JOIN rmt_tbl_2
#>   ON (rmt_tbl_1.a = rmt_tbl_2.a)
#> LEFT JOIN rmt_tbl_3
#>   ON (rmt_tbl_1.a = rmt_tbl_3.a)
#> LEFT JOIN rmt_tbl_4
#>   ON (rmt_tbl_1.a = rmt_tbl_4.a)

Created on 2023-02-15 with reprex v2.0.2
In the SELECT part of the query rmt_tbl_3.e is wrongly selected twice, once as e_3 (correct) and once as e_4 (wrong).
This behavior leads to the differences in the flatten-snapshot.

from dm.

TSchiefer avatar TSchiefer commented on July 18, 2024

see tidyverse/dbplyr#1158 and tidyverse/dbplyr#1159

from dm.

TSchiefer avatar TSchiefer commented on July 18, 2024

Both of these problems will be fixed on GHA after the next patch-release of dbplyr.

from dm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.