cynkra / dm Goto Github PK

View Code? Open in Web Editor NEW

494.0 10.0 50.0 53.6 MB

Working with relational data models in R

Home Page: https://dm.cynkra.com

License: Other

R 97.96% Makefile 0.29% CSS 0.52% JavaScript 0.96% HTML 0.20% Dockerfile 0.08%

r relational-databases data-model dbplyr dbi datawarehousing data-warehousing

dm's People

Stargazers

Watchers

Forkers

tschiefer balthasars cutterkom philipp-baumann bbecane jasonyum lizarova777 jmjohns9 lorenzwalthert jonmcalder jawond gadenbuie nzgwynn pat-s samssann pachadotdev minghao2016 moutikabdessabour jimsforks feinmann moodymudskipper mgirlich erictleung mpatel-cai jfontestad brancengregory romainfrancois pawelru noamross ratnanil andreranza eutwt l-ramirez-lopez rnaimehaom nbenn owenjonesuob sound118 davisvaughan carlganz antonov548 desval mikejohnpage jangorecki layalchristine24 olivroy spencerx mlell aavanesy indrajeetpatil salim-b

dm's Issues

Implement cdm_select()

If trying to remove a key column (pk/fk), a message should be given similar to:

library(tidyverse)
iris %>%
  group_by(Species) %>%
  select(-Species)
#> Adding missing grouping variables: `Species`
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>          <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa           5.1         3.5          1.4         0.2
#>  2 setosa           4.9         3            1.4         0.2
#>  3 setosa           4.7         3.2          1.3         0.2
#>  4 setosa           4.6         3.1          1.5         0.2
#>  5 setosa           5           3.6          1.4         0.2
#>  6 setosa           5.4         3.9          1.7         0.4
#>  7 setosa           4.6         3.4          1.4         0.3
#>  8 setosa           5           3.4          1.5         0.2
#>  9 setosa           4.4         2.9          1.4         0.2
#> 10 setosa           4.9         3.1          1.5         0.1
#> # … with 140 more rows

^{Created on 2019-09-06 by the reprex package (v0.3.0)}

Also consider prune option to cdm_select() (later!) that removes all related tables if a key is removed (with #44).

activate()-style API

With #57 and part of #54 we're now ready to start looking into an API that extracts a table from a dm but keeps the context, so that it can be put back.

dm %>%
  activate(flights) %>%
  filter(hour < 6) %>%
  ...

How do we put back a table?

Overwrite an existing table, removing all obsolete keys
Create a new table and add primary and foreign keys where we know that they haven't been altered

How are the verbs called?

We temporarily remove a piece of a mosaic and reinsert it or perhaps insert a replica. Is reinsertion different from adding a new version? Maybe zoom_to_tbl(), update_tbl(), insert_tbl() ?

Roadmap

zoom_to_tbl() and update_tbl(), with print() and format() method
select.dm() and rename.dm(), with a nice error message if not activated
filter.dm()
mutate.dm() and transmute.dm()
summarise.dm()
group_by.dm(), ungroup.dm()

Independently:

cdm_add_tbl() (useful for #77)
insert_tbl()

Replace link to relational data model explanation?

I came across this one here, which I like:
https://www.guru99.com/relational-data-model-dbms.html

(We could also create another short vignette "Intro to relational data models"; but I don't think that should be necessary?)

cdm_draw is duplicating a column name

Reproducible code (R and all CRAN packages up to date, dm up to date)

dm0 <- structure(list(src = structure(list(tbl_f = function (data) 
{
    as_tibble(data, .name_repair = "check_unique")
}, name = "<environment: 0x000001b8d0e65a70>", env = <environment>), class = c("src_local", 
"src")), tables = list(exceptions = structure(list(site = character(0), 
    onprem_id = numeric(0), exception_message = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame")), sam1_flight = structure(list(onprem_id = numeric(0), 
    station = character(0), received_at = structure(numeric(0), tzone = "", class = c("POSIXct", 
    "POSIXt")), data = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame")), sam26_event = structure(list(onprem_id = numeric(0), 
    station = character(0), received_at = structure(numeric(0), tzone = "", class = c("POSIXct", 
    "POSIXt")), data = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame"))), data_model_tables = structure(list(table = c("exceptions", 
"sam1_flight", "sam26_event"), segment = c(NA, NA, NA), display = c("accent4", 
"accent2", "accent2")), row.names = c(3L, 12L, 13L), class = "data.frame"), 
    data_model_pks = structure(list(column = c("onprem_id", "onprem_id"
    ), table = c("sam1_flight", "sam26_event")), row.names = c(NA, 
    -2L), class = c("tbl_df", "tbl", "data.frame")), data_model_fks = structure(list(
        table = c("exceptions", "exceptions"), column = c("onprem_id", 
        "onprem_id"), ref = c("sam26_event", "sam1_flight"), 
        ref_col = c("onprem_id", "onprem_id")), row.names = c(NA, 
    -2L), class = c("tbl_df", "tbl", "data.frame")), filter = NULL), class = "dm")
  
###############

cdm_draw(dm0[c("exceptions","sam1_flight")],view_type = "all")
cdm_draw(dm0,view_type = "all")

The first cdm_draw call prints what I expect :

With the second call the column onprem_id is duplicated:

decompose_table() with user-specified keys

Should work when the user provides a custom key that is unique. Currently:

library(dplyr)
library(dm)

nycflights13::planes %>% 
  add_count(model, engines, name = "n_eng") %>% 
  add_count(model, seats, name = "n_seats") %>% 
  add_count(model, manufacturer, name = "n_manufacturer") %>% 
  mutate(model_id = paste0(
    model,
    if_else(n_eng > 1, paste0("-", engines), ""),
    if_else(n_seats > 1, paste0("-", seats), ""),
    if_else(n_manufacturer > 1, paste0("-", abbreviate(manufacturer)), "")
  )) %>% 
  decompose_table(model_id, model, manufacturer, type, engines, seats, manufacturer)
#> `new_id_column` can not have an identical name as one of the columns of
#> `.`.

^{Created on 2019-09-24 by the reprex package (v0.3.0)}

A more helpful error message when using numeric indices on dm objects ?

Thanks for the fantastic package.

I think the following error could be improved :

library(dm)
library(dplyr,warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 3.6.1
dm0 <- as_dm(lst(iris,cars))
dm0[[1]]
#> Can't convert a double vector to a string

^{Created on 2019-07-31 by the reprex package (v0.3.0)}

dm:::`[[.dm`
#> function (x, name) 
#> {
#>     table <- as_string(name)
#>     tbl(x, table)
#> }

My understanding is that dm objects are lists rather than environments because you want to pass them by copy, but at the same time you'd rather not allow numeric indices because order doesn't make sense in a DB.

In that case I would suggest something like "Can't subset a dm object by position, either subset by name or use cdm_get_tables() to convert to a regular list first."

Note that the choice not to allow numeric indices make the operation : map(dm0, nrow) fail, which might be unexpected by a user who had previously success running nrow(dm0[["iris"]]) and nrow(dm0[["cars"]]) .

cdm_flatten_to_tbl() should flatten neighborhood

if no tables are given.

Feature request: accept NULL arguments in `cdm_add_fk()` to guess.

In my use case I would like to define only my primary keys manually, then run a command on my data model and have all obvious foreign keys set up appropriately.

I believe many models are fairly regular so having a feature to guess the keys can be pretty awesome, for me that would spare a fair amount of lines of code and would allow me not to update my code much when adding a table.

I believe this ould be achieved by alowing NULL arguments in cdm_add_fk()

Standard use :

cdm_add_fk(my_dm, my_table, my_column, my_ref_table, check = TRUE)

Look for my_column in all tables from my_dm, and set them as foreign keys linked to my_ref_table 's primary key :

cdm_add_fk(my_dm, NULL, my_column, my_ref_table, check = TRUE)

Look for any column in my_table named as the primary keys of my_ref_table, and make them foreign keys :

cdm_add_fk(my_dm, my_table, NULL, my_ref_table, check = TRUE)

Look for any column in any table of my_dm named as the primary keys of my_ref_table, and make them foreign keys :

cdm_add_fk(my_dm, NULL, NULL, my_ref_table, check = TRUE)

Look for a primary key on my_column in any table of my_dm and build the appropriate foreign key.

cdm_add_fk(my_dm, my_table, my_column, NULL, check = TRUE)

Look for a primary key on my_column in any table of my_dm and build the appropriate foreign keys on all columns of other tables containing this column.

cdm_add_fk(my_dm, NULL, my_column, NULL, check = TRUE)

Look for all primary keys, see if any is matching by name the columns of my_table and build the appropriate foreign keys.

cdm_add_fk(my_dm, my_table, NULL, NULL, check = TRUE)

Look for all primary keys, see if any is matching by name the columns of all other tables and build the appropriate foreign keys.

cdm_add_fk(my_dm, NULL, NULL, NULL, check = TRUE)

check = TRUE could either trigger failure in case of incompatible subset of value, or issue an explicit warning not define ineligible foreign keys. (check = NA might be used to choose between these).

Improve internal data structure again

To make it easy to maintain internal consistency, I think the internal data structure needs to be a tibble with the following columns:

Table name (character)
Table object (data frame/tbl/...) (list)
Color (character)
Filter expressions (list)
Primary keys (list of character)
Child tables (inverse of foreign keys; list of tibbles of the form "table name + list of character")

With this, it seems that we only need two "restore" functions:

Fix table names after a table select/rename
Fix column names after a column select/rename

These "restore" functions accept a tidyselect-like recipe that describes how to handle renames. These can be passed on to dplyr::recode() to implement the actual recoding.

`dm_repair_constraints()` (OP: Supporting relationships with weaker constraints)

Is it possible, perhaps outside of the dm ecosystem, to model relationships between tables that aren't based on FK-PK relationships. Think of a situation where there is a flights and an airports table, where some airport codes may not exist in the flights table (or the other way around). These would definitely be messy data relationship, but do occur fairly frequently in the data I work with. My current approach is to write lots of dplyr join statements (and try to remember the relationships) or to build views in a lightweight system like sqlite and query the views. It would be nice to be able to codify these relationships directly within R.

Thanks for the time and the package!

Stricter logic for `cdm_flatten_to_tbl()`

The function itself should not think too much by itself, it should apply the same logic for all join-methods. Furthermore, only tables that are (direct) parents will be allowed for this function (no grandparents etc.). If the user passes tables in the ellipsis, the joins will happen in this order.

left_join(), inner_join(), full_join(): if no tables given in the ellipsis, the result will depend on the order of the tables in the (relevant part of the) dm, but it will only affect the column order.
semi_join(): the result will be exactly those rows, that are referenced by ALL parent tables.
anti_join(): the result will be those rows, that are NOT referenced by ANY parent table (i.e. referenced by NO parent table). It might be more interesting for the user to have a function telling them, which rows are not referenced by at least one parent table, but that needs to be another function.
nest_join(): the created columns should have the names of the RHS tables of the join. Again, if ellipsis empty, the order of those tables in the dm plays a role for the resulting order of the columns.
right_join(): if the ellipsis is empty, the number of rows might depend on the table that comes last in the dm table list (if no referential integrity is given). In this case, a warning is issued.

There will be a further function cdm_flatten_deep_to_tbl(), which works only for left_join(), inner_join() and full_join(), which also allows for deeper table hierarchies (grandparents etc.)

Clarify error message when joining tables with cycles

While working through the README and vignettes to learn this package (it's fantastic by the way, thank you!) I ran into this error:

dm_nycflights <- cdm_nycflights13(cycle = TRUE)

# ... testing and exploration ...

dm_nycflights %>%
  cdm_join_to_tbl(airports, flights, join = semi_join)
## Error: `table` must be a string.
## Call `rlang::last_error()` to see a backtrace

It took me a while to figure out that the issue was cycle = TRUE and not the format of the table arguments. Finally, I realized the lines I was supposed to run needed cycle = FALSE.

Maybe cdm_join_to_tbl() could provide a better hint in this case?

As a related suggestion, the README mentions that cdm_join_to_tbl() requires cycle-free relationships, but it might be helpful to explain why this is the case.

Support for compound keys

Are there any plans/thoughts on supporting compound keys? I frequently have a table where the primary key is a combination of fields. It looks like putting a PK constraint on something like that in dm requires creating a new single column that is a mutation of the two non-unique components.

Thanks for the package!

Avoid using cdm_get_data_model() where possible

We now have a much better internal structure available through cdm_get_def() etc.. Remove things related to new_dm() and new_dm2(), prefer new_dm3() over new_dm2() or even new_dm().

Explain in more detail why column is not a foreign key

Mention which values are not contained in the referred primary key. In #17 I already added printing of duplicate values for primary key candidates.

Schema support

Required for copying to and learning from a database.

Options:

Rely on "default schema" (e.g. Oracle) or "search path" (e.g. in Postgres) -- many but not all DBMS know this concept
Pass in an optional schema qualifier in copy/learn operations

The first option is simpler, but it looks like we need to take care of the schema identifier ourselves.

Let's add a schema argument which takes a DBI::Id object.

Ideally, cdm_learn_from_db() would be part of the dm() constructor -- later.

Handling of filter conditions for `cdm_get_tables()`

Currently ignores filter conditions. When using cdm_get_tables() while filter conditions are set we should do one of:

give a warning message
throw an error
return the filtered tables

In the third case (which I think is the approach we should probably chose) we would need to have a new function that's implemented like cdm_get_tables() currently is, and use it wherever possible, because in many places we just need for example the column names and we don't care about filters being applied or not.

Mutating from different tables

Hi all,

I think that {dm} offers great prospects for those handling complex datasets. In my specific case, I would love to be able to use {dm} to handle a dozen of locally stored interrelated tables, instead of having to perform endless joining of tables and duplicating the information multiple times. It is not so much to save memory but to make manipulation more natural and safer.

Yet, not coming from the relational database side of things, but having instead always used data frames in R, I see one critical obstacle preventing me from using your otherwise great package: the impossibility to perform mutation (sensu dplyr::mutate, dplyr::transmute) even for locally stored data (if I missed it, very sorry).

I would like to share some thoughts about that in case you would like to consider the idea.

IMO the best from a user perspective would probably be to have a dplyr::mutate-like function. The issue is that ideally a mutation could depend on variables located in different tables (with the usual restriction of source vectors being the same length or of length one). In this context, it is not clear how to create a slick syntax for that, but perhaps a dedicated operator could help.

Here is a made up example (probably stupid from an aviation perspective) of what it could look like:

fuel_per_mile_per_engine <- 2.5 # in gallons
cdm_mutate(
   data = dm::cdm_nycflights13(),
   table = flights,
   fuel = distance %from% flights * engines %from% planes * fuel_per_mile_per_engine
)

table would refer to where to insert the new column, fuel = xxx would feed the ... argument and the %from% (or %$% ?) would quote both the lhs and rhs to evaluate them in the right context. If the the columns name are unique across tables, we could even imagine that the %from% tbl part could be guessed by cdm_mutate().

a perhaps easier (but not necessarily alternative) solution would be to implement a cdm_update_tbl() which would be similar to cdm_add_tbl() but that would allow to update an existing table. The idea is that one would use tbl() to extract the required tables, do the mutation steps (and joins if required) using {dplyr} directly and then put the table back into the dm object using the cdm_update_tbl() function. Contrary to cdm_rm_tbl() + cdm_add_tbl(), the cdm_update_tbl() would present the benefit of not loosing the definition of the primary keys and foreign keys in the process.

I know this represent some work and that you must set priorities but those would be very welcome feature additions.

Feel free to ignore that and close the issue if you think I am missing the point.

Alex

corner case handling of 'adapt_fk_cols()'

need to consider case where

multiple tables have pk-columns of the same name and different fk-columns pointing to them from the same table.
two foreign key columns from one table point to one and the same pk-column of another table

Implement cdm_rename()

Required for #40. Same syntax as cdm_filter() .

cdm_select() is similar but has the additional complexity that removed columns need to be taken care of. Later.

Default settings for `cdm_copy_to()`

The default parameters setting for cdm_copy_to() should produce temporary tables with unique_table_names.

Idea is to have the default unique_table_names = NULL, which becomes TRUE when temporary = TRUE and FALSE otherwise.

Could either be fixed in #79 or in a future PR.

explanation why not key candidate

The tibbles produced by the functions cdm_enum_pk_candidates() and cdm_enum_fk_candidates() could include one more column in which it is explained (by a code or verbose), why the column was rejected as a key candidate.

Support cdm_join_to_tbl() without cdm_apply_filters()

Requires a more powerful cdm_flatten_to_tbl().

in `?cdm_draw` definition of `view_type` parameter mentions wrong default value

view_type
Can be "all" (by default), "keys_only" or "title_only". It defines the level of details for the table rendering (all columns, only primary and foreign keys or no columns)

In fact the default value at the moment is "keys_only" :

formals(cdm_draw)$view_type
#> [1] "keys_only"

Multi-joins

with cdm_join_tbl() we are able to join two tables of a dm which share a direct link (FK).

what if we want to join a further table from the same dm in a second step etc.?

A solution would be a new function, which wraps cdm_join_tbl() and then returns the original dm, but with the result of the join as an additional table. This way, there would be the possibility of adding a secondary join to this table in the pipeline.

Questions would be:

what will the PK of the new table be (depends probably on the type of the join and the type of the relation between the joined tables)?
how to determine the FKs from and to the newly created table?

Support [[ with numeric indexes

Otherwise the dm isn't shown in RStudio's object pane (this messes up rlang::last_error() too).

Define filtering operation

referential integrity of the result guaranteed
filter then flatten is the same as flatten with inner join then filter (?)

In the documentation in ?cdm_filter and vignette("dm-filtering") .

Add new article "How to construct a dm"

The "basic opreation" article does this, but there's room for a simpler introduction, as in "How to Build a Relational Data Model".

CC @cutterkom.

Redesign constructors and validators

dm() should be replaced by logic to extract pk/fk information from a datamodelr object
new_dm() becomes dm(), no longer accepts datamodelr object
new_dm2() becomes new_dm() -- a thin wrapper around tibble() after #57. Need to revisit interface carefully. Maybe extract patch_dm() to update parts of an existing dm?

Simple drawing mode

Simplify m:n relations
Embed lookup tables

Get rid of src component

It's not really needed. We only need to check same_src() for all input tables.

Using git2rdata to store dm objects as plain text files suitable for version control

Interesting package.

@krlmlr and I had a chat at useR!2019 about linking the dm package with the git2rdata package(https://ropensci.github.io/git2rdata/).

I would suggest that I send a PR that uses git2rdata to read and write a dm object to plain text files suitable for storage under version control. A bit similar to cdm_copy_to().

Here are a few ideas.

make cdm_copy_to() into an S3 generic with methods for src_dbi, git_repository and character. The method for src_dbi is the current implementation. git_repository writes tables to a path in the root of git repo. character assumes that the value is the path to the root of a project folder.
git2rdata writes the files into a subfolder of a repository. The user can specify this subfolder. Equivalent to selecting the database.
each table is stored as a git2rdata object (tsv + yml file), using the name of the table in the dm object
how to store the src component? Or could the dm class also accept a path or url as source?
the datamodel object seems to be in a yaml format, so we can store that as is. Maybe with some protected name? To avoid that there is a table with the same name.

Remove disconnected tables

Goal: For a key column (primary/foreign), remove all references and all tables that become disconnected if the key column is removed.

Implement cdm_get_con()

Like cdm_get_src() but returns a database connection (or an error if a data frame or other object).

Provide way to check referential integrity

After b23b969, we need an easy way to check integrity of all pk and fk constraints, with an option to show where integrity is violated:

one function that returns a data frame with results of the integrity check
one function that throws an error if integrity is violated, calls the first function

What's a good name?

as_dm() for lists of remote tables

doesn't work yet:

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(dm)

as_dm(list(a = memdb_frame(a = 1)))
#> Error in UseMethod("as.data_model"): no applicable method for 'as.data_model' applied to an object of class "NULL"

^{Created on 2019-08-13 by the reprex package (v0.3.0)}

remove parameter `all_connected` until proper implementation is found

in cdm_select_tbls() there is currently the parameter all_connected available and even set by default to TRUE. The current implementation is not covering all general use cases and therefore the parameter should be removed until further notice.

function `cdm_reset_filter()`

This function would remove either all filters, or better, it let's you remove the n-th filter, cause you made a typo there, and you don't want to remove all filters at once because the rest seems fine. Maybe two different functions for those two scenarios?

Replicate example for converting transaction data into long form

Use a random "rich gets richer" graph.

Consistent `rename()` and `select()` syntax

{dm} should aim at using the {tidyselect}-standard (e.g. rename(table, new_colname = old_colname)) for our functions cdm_rename_table() and cdm_select_tbl().

Distinguish between unit tests and integration tests

Each function/operation should have several unit tests and one or two integration tests.

Each R/*.R file should have a corresponding tests/testthat/test-*.R file. The tests in that file should touch (almost) all code in that *.R file. (How to verify? Can we easily collect this from covr? Perhaps by running repeatedly covr::package_coverage(type = "none", code = 'devtools::test(pattern = "^...$")') .)

Implement for new tests, close this issue when all existing tests have been adapted.

unit tests

small examples on local data frames only
focusing on a specific behavior
easy to understand and to rerun in the console

intgegration tests

one larger complex-ish example
running on all sources including databases
known input, known output

Filter cascades create too complex SQL

With one filter, everything seems fine. When a second filter is added, the first filter is applied redundantly (via joins) for every table in the data model. This may lead to an overload of the query planner.

Potential solutions:

Add a new verb for applying referential integrity constraints, don't apply them yet after each filter
Materialize enumerations of IDs for each table
?

library(tidyverse)
library(dm)

dm <-
  cdm_nycflights13() %>%
  cdm_select_tbl(-weather) %>%
  cdm_filter(flights, month == 1, day == 1) %>%
  cdm_copy_to(dbplyr::src_memdb(), .)

dm %>%
  cdm_filter(flights, year == 1) %>%
  cdm_get_tables() %>%
  map(dbplyr::sql_render)
#> $airlines
#> <SQL> SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )
#> 
#> $airports
#> <SQL> SELECT * FROM `airports` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )
#> 
#> $flights
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)
#> 
#> $planes
#> <SQL> SELECT * FROM `planes` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )

dm %>%
  cdm_filter(flights, year == 1) %>%
  cdm_filter(airlines, name == "xyz") %>%
  cdm_get_tables() %>%
  map(dbplyr::sql_render)
#> $airlines
#> <SQL> SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')
#> 
#> $airports
#> <SQL> SELECT * FROM (SELECT * FROM `airports` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )
#> 
#> $flights
#> <SQL> SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )
#> 
#> $planes
#> <SQL> SELECT * FROM (SELECT * FROM `planes` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )

^{Created on 2019-08-14 by the reprex package (v0.3.0)}

Implement cdm_disambiguate_cols()

Goal: rename all columns that exist in more than one table. Prefixed by the table name with a separator, . as default.

With optional verbose output (quiet = FALSE).

Should also take care of updating pk and fk.

Implement cdm_add_tbl() and cdm_rm_tbl()

For adding, use def[0, ] as boilerplate for the new row in the def. Use vctrs::vec_rbind() for row-binding tibbles.

Removing is implemented for symmetry, the documentation should point to cdm_select_tbl() .

Use @rdname to document in the same .Rd file.

Rewrite all examples and documentation to use this new idiom.

additions to `dm_flatten_to_tbl()`

IMHO cdm_flatten_to_tbl() should feature:

a logical parameter drop_key_cols, since we often do not want to keep the code-columns used as keys, but only the real info (then reunite_parent_child() might come in handy)
the user should be informed by a message which tables are joined to the start-table

Simplify code for querying and updating dm objects

Currently, we always call new_dm() with a fully constructed datamodelr object. The purpose of #9 was to introduce a simpler internal data structure which now can be used for updating a dm.

We should offer a way to construct a dm object from internal components only, essentially extracting an internal constructor from new_dm(). Also, we should provide accessor functions for the new components. Then, all code can work on the internal data structure. This should simplify the codebase a lot.

https://github.com/krlmlr/dm/pull/9/files#r306247485

As a result, calls to cdm_get_data_model() should become very rare, needed mostly for cdm_draw() and friends.

https://github.com/krlmlr/dm/pull/9/files#r306248024

Extract expect_cdm_error

and replace everywhere:

expect_cdm_error <- function(expr, class) {
  expect_error(expr, cdm_error(class))
}

Visualisation of column acting as foreign key more than once

If a column is a foreign key more than once, it also appears in the visualisation more than once. It should only occur once.

library(dm)
library(tibble)
a <- tibble(a = 1:3)
b <- tibble(b = 1:5)
c <- tibble(c = 1:4)

as_dm(list(a = a, b = b, c = c)) %>% 
  cdm_add_pk(b, b) %>% 
  cdm_add_pk(c, c) %>% 
  cdm_add_fk(a, a, b) %>% 
  cdm_add_fk(a, a, c) %>% cdm_draw()

^{Created on 2019-08-27 by the reprex package (v0.3.0)}

New operation: disentangle

dm in, dm, out.

Starting from a table, create copies for all tables that can be reached through more than one way in the relation graph.

Example: flights -> airports has two paths, would create two identical copies origin.airports and dest.airports (or origin and dest) in the resulting dm, breaking the cycle.

Need to think about how this can work with deep hierarchies.

Implementation:

dfs or bfs
compute induced subgraph by the nodes visited
nodes with more than one incoming edge in the induced subgraph must be duplicated