Coder Social home page Coder Social logo

cynkra / dm Goto Github PK

View Code? Open in Web Editor NEW
494.0 10.0 50.0 53.6 MB

Working with relational data models in R

Home Page: https://dm.cynkra.com

License: Other

R 97.96% Makefile 0.29% CSS 0.52% JavaScript 0.96% HTML 0.20% Dockerfile 0.08%
r relational-databases data-model dbplyr dbi datawarehousing data-warehousing

dm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dm's Issues

Implement cdm_select()

If trying to remove a key column (pk/fk), a message should be given similar to:

library(tidyverse)
iris %>%
  group_by(Species) %>%
  select(-Species)
#> Adding missing grouping variables: `Species`
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>          <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa           5.1         3.5          1.4         0.2
#>  2 setosa           4.9         3            1.4         0.2
#>  3 setosa           4.7         3.2          1.3         0.2
#>  4 setosa           4.6         3.1          1.5         0.2
#>  5 setosa           5           3.6          1.4         0.2
#>  6 setosa           5.4         3.9          1.7         0.4
#>  7 setosa           4.6         3.4          1.4         0.3
#>  8 setosa           5           3.4          1.5         0.2
#>  9 setosa           4.4         2.9          1.4         0.2
#> 10 setosa           4.9         3.1          1.5         0.1
#> # … with 140 more rows

Created on 2019-09-06 by the reprex package (v0.3.0)

Also consider prune option to cdm_select() (later!) that removes all related tables if a key is removed (with #44).

activate()-style API

With #57 and part of #54 we're now ready to start looking into an API that extracts a table from a dm but keeps the context, so that it can be put back.

dm %>%
  activate(flights) %>%
  filter(hour < 6) %>%
  ...

How do we put back a table?

  • Overwrite an existing table, removing all obsolete keys
  • Create a new table and add primary and foreign keys where we know that they haven't been altered

How are the verbs called?

We temporarily remove a piece of a mosaic and reinsert it or perhaps insert a replica. Is reinsertion different from adding a new version? Maybe zoom_to_tbl(), update_tbl(), insert_tbl() ?

Roadmap

  • zoom_to_tbl() and update_tbl(), with print() and format() method
  • select.dm() and rename.dm(), with a nice error message if not activated
  • filter.dm()
  • mutate.dm() and transmute.dm()
  • summarise.dm()
  • group_by.dm(), ungroup.dm()

Independently:

  • cdm_add_tbl() (useful for #77)
  • insert_tbl()

cdm_draw is duplicating a column name

Reproducible code (R and all CRAN packages up to date, dm up to date)

dm0 <- structure(list(src = structure(list(tbl_f = function (data) 
{
    as_tibble(data, .name_repair = "check_unique")
}, name = "<environment: 0x000001b8d0e65a70>", env = <environment>), class = c("src_local", 
"src")), tables = list(exceptions = structure(list(site = character(0), 
    onprem_id = numeric(0), exception_message = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame")), sam1_flight = structure(list(onprem_id = numeric(0), 
    station = character(0), received_at = structure(numeric(0), tzone = "", class = c("POSIXct", 
    "POSIXt")), data = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame")), sam26_event = structure(list(onprem_id = numeric(0), 
    station = character(0), received_at = structure(numeric(0), tzone = "", class = c("POSIXct", 
    "POSIXt")), data = character(0)), row.names = integer(0), class = c("tbl_df", 
"tbl", "data.frame"))), data_model_tables = structure(list(table = c("exceptions", 
"sam1_flight", "sam26_event"), segment = c(NA, NA, NA), display = c("accent4", 
"accent2", "accent2")), row.names = c(3L, 12L, 13L), class = "data.frame"), 
    data_model_pks = structure(list(column = c("onprem_id", "onprem_id"
    ), table = c("sam1_flight", "sam26_event")), row.names = c(NA, 
    -2L), class = c("tbl_df", "tbl", "data.frame")), data_model_fks = structure(list(
        table = c("exceptions", "exceptions"), column = c("onprem_id", 
        "onprem_id"), ref = c("sam26_event", "sam1_flight"), 
        ref_col = c("onprem_id", "onprem_id")), row.names = c(NA, 
    -2L), class = c("tbl_df", "tbl", "data.frame")), filter = NULL), class = "dm")
  
###############

cdm_draw(dm0[c("exceptions","sam1_flight")],view_type = "all")
cdm_draw(dm0,view_type = "all")

The first cdm_draw call prints what I expect :

image

With the second call the column onprem_id is duplicated:

image

decompose_table() with user-specified keys

Should work when the user provides a custom key that is unique. Currently:

library(dplyr)
library(dm)

nycflights13::planes %>% 
  add_count(model, engines, name = "n_eng") %>% 
  add_count(model, seats, name = "n_seats") %>% 
  add_count(model, manufacturer, name = "n_manufacturer") %>% 
  mutate(model_id = paste0(
    model,
    if_else(n_eng > 1, paste0("-", engines), ""),
    if_else(n_seats > 1, paste0("-", seats), ""),
    if_else(n_manufacturer > 1, paste0("-", abbreviate(manufacturer)), "")
  )) %>% 
  decompose_table(model_id, model, manufacturer, type, engines, seats, manufacturer)
#> `new_id_column` can not have an identical name as one of the columns of
#> `.`.

Created on 2019-09-24 by the reprex package (v0.3.0)

A more helpful error message when using numeric indices on dm objects ?

Thanks for the fantastic package.

I think the following error could be improved :

library(dm)
library(dplyr,warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 3.6.1
dm0 <- as_dm(lst(iris,cars))
dm0[[1]]
#> Can't convert a double vector to a string

Created on 2019-07-31 by the reprex package (v0.3.0)

dm:::`[[.dm`
#> function (x, name) 
#> {
#>     table <- as_string(name)
#>     tbl(x, table)
#> }

My understanding is that dm objects are lists rather than environments because you want to pass them by copy, but at the same time you'd rather not allow numeric indices because order doesn't make sense in a DB.

In that case I would suggest something like "Can't subset a dm object by position, either subset by name or use cdm_get_tables() to convert to a regular list first."


Note that the choice not to allow numeric indices make the operation : map(dm0, nrow) fail, which might be unexpected by a user who had previously success running nrow(dm0[["iris"]]) and nrow(dm0[["cars"]]) .

Feature request: accept NULL arguments in `cdm_add_fk()` to guess.

In my use case I would like to define only my primary keys manually, then run a command on my data model and have all obvious foreign keys set up appropriately.

I believe many models are fairly regular so having a feature to guess the keys can be pretty awesome, for me that would spare a fair amount of lines of code and would allow me not to update my code much when adding a table.

I believe this ould be achieved by alowing NULL arguments in cdm_add_fk()

Standard use :

cdm_add_fk(my_dm, my_table, my_column, my_ref_table, check = TRUE) 

Look for my_column in all tables from my_dm, and set them as foreign keys linked to my_ref_table 's primary key :

cdm_add_fk(my_dm, NULL, my_column, my_ref_table, check = TRUE) 

Look for any column in my_table named as the primary keys of my_ref_table, and make them foreign keys :

cdm_add_fk(my_dm, my_table, NULL, my_ref_table, check = TRUE) 

Look for any column in any table of my_dm named as the primary keys of my_ref_table, and make them foreign keys :

cdm_add_fk(my_dm, NULL, NULL, my_ref_table, check = TRUE)

Look for a primary key on my_column in any table of my_dm and build the appropriate foreign key.

cdm_add_fk(my_dm, my_table, my_column, NULL, check = TRUE) 

Look for a primary key on my_column in any table of my_dm and build the appropriate foreign keys on all columns of other tables containing this column.

cdm_add_fk(my_dm, NULL, my_column, NULL, check = TRUE) 

Look for all primary keys, see if any is matching by name the columns of my_table and build the appropriate foreign keys.

cdm_add_fk(my_dm, my_table, NULL, NULL, check = TRUE) 

Look for all primary keys, see if any is matching by name the columns of all other tables and build the appropriate foreign keys.

cdm_add_fk(my_dm, NULL, NULL, NULL, check = TRUE) 

check = TRUE could either trigger failure in case of incompatible subset of value, or issue an explicit warning not define ineligible foreign keys. (check = NA might be used to choose between these).

Improve internal data structure again

To make it easy to maintain internal consistency, I think the internal data structure needs to be a tibble with the following columns:

  • Table name (character)
  • Table object (data frame/tbl/...) (list)
  • Color (character)
  • Filter expressions (list)
  • Primary keys (list of character)
  • Child tables (inverse of foreign keys; list of tibbles of the form "table name + list of character")

With this, it seems that we only need two "restore" functions:

  • Fix table names after a table select/rename
  • Fix column names after a column select/rename

These "restore" functions accept a tidyselect-like recipe that describes how to handle renames. These can be passed on to dplyr::recode() to implement the actual recoding.

`dm_repair_constraints()` (OP: Supporting relationships with weaker constraints)

Is it possible, perhaps outside of the dm ecosystem, to model relationships between tables that aren't based on FK-PK relationships. Think of a situation where there is a flights and an airports table, where some airport codes may not exist in the flights table (or the other way around). These would definitely be messy data relationship, but do occur fairly frequently in the data I work with. My current approach is to write lots of dplyr join statements (and try to remember the relationships) or to build views in a lightweight system like sqlite and query the views. It would be nice to be able to codify these relationships directly within R.

Thanks for the time and the package!

Stricter logic for `cdm_flatten_to_tbl()`

The function itself should not think too much by itself, it should apply the same logic for all join-methods. Furthermore, only tables that are (direct) parents will be allowed for this function (no grandparents etc.). If the user passes tables in the ellipsis, the joins will happen in this order.

  1. left_join(), inner_join(), full_join(): if no tables given in the ellipsis, the result will depend on the order of the tables in the (relevant part of the) dm, but it will only affect the column order.
  2. semi_join(): the result will be exactly those rows, that are referenced by ALL parent tables.
  3. anti_join(): the result will be those rows, that are NOT referenced by ANY parent table (i.e. referenced by NO parent table). It might be more interesting for the user to have a function telling them, which rows are not referenced by at least one parent table, but that needs to be another function.
  4. nest_join(): the created columns should have the names of the RHS tables of the join. Again, if ellipsis empty, the order of those tables in the dm plays a role for the resulting order of the columns.
  5. right_join(): if the ellipsis is empty, the number of rows might depend on the table that comes last in the dm table list (if no referential integrity is given). In this case, a warning is issued.

There will be a further function cdm_flatten_deep_to_tbl(), which works only for left_join(), inner_join() and full_join(), which also allows for deeper table hierarchies (grandparents etc.)

Clarify error message when joining tables with cycles

While working through the README and vignettes to learn this package (it's fantastic by the way, thank you!) I ran into this error:

dm_nycflights <- cdm_nycflights13(cycle = TRUE)

# ... testing and exploration ...

dm_nycflights %>%
  cdm_join_to_tbl(airports, flights, join = semi_join)
## Error: `table` must be a string.
## Call `rlang::last_error()` to see a backtrace

It took me a while to figure out that the issue was cycle = TRUE and not the format of the table arguments. Finally, I realized the lines I was supposed to run needed cycle = FALSE.

Maybe cdm_join_to_tbl() could provide a better hint in this case?

As a related suggestion, the README mentions that cdm_join_to_tbl() requires cycle-free relationships, but it might be helpful to explain why this is the case.

Support for compound keys

Are there any plans/thoughts on supporting compound keys? I frequently have a table where the primary key is a combination of fields. It looks like putting a PK constraint on something like that in dm requires creating a new single column that is a mutation of the two non-unique components.

Thanks for the package!

Avoid using cdm_get_data_model() where possible

We now have a much better internal structure available through cdm_get_def() etc.. Remove things related to new_dm() and new_dm2(), prefer new_dm3() over new_dm2() or even new_dm().

  • selection of tables
  • selection of columns (with cdm_restore())
  • primary keys
  • foreign keys
  • filtering
  • collect and copy_to
  • disambiguate
  • print
  • tests and helpers
  • coverage test to remove dead code
  • learn

Schema support

Required for copying to and learning from a database.

Options:

  1. Rely on "default schema" (e.g. Oracle) or "search path" (e.g. in Postgres) -- many but not all DBMS know this concept
  2. Pass in an optional schema qualifier in copy/learn operations

The first option is simpler, but it looks like we need to take care of the schema identifier ourselves.

Let's add a schema argument which takes a DBI::Id object.

Ideally, cdm_learn_from_db() would be part of the dm() constructor -- later.

Handling of filter conditions for `cdm_get_tables()`

Currently ignores filter conditions. When using cdm_get_tables() while filter conditions are set we should do one of:

  1. give a warning message
  2. throw an error
  3. return the filtered tables

In the third case (which I think is the approach we should probably chose) we would need to have a new function that's implemented like cdm_get_tables() currently is, and use it wherever possible, because in many places we just need for example the column names and we don't care about filters being applied or not.

Mutating from different tables

Hi all,

I think that {dm} offers great prospects for those handling complex datasets. In my specific case, I would love to be able to use {dm} to handle a dozen of locally stored interrelated tables, instead of having to perform endless joining of tables and duplicating the information multiple times. It is not so much to save memory but to make manipulation more natural and safer.

Yet, not coming from the relational database side of things, but having instead always used data frames in R, I see one critical obstacle preventing me from using your otherwise great package: the impossibility to perform mutation (sensu dplyr::mutate, dplyr::transmute) even for locally stored data (if I missed it, very sorry).

I would like to share some thoughts about that in case you would like to consider the idea.

  • IMO the best from a user perspective would probably be to have a dplyr::mutate-like function. The issue is that ideally a mutation could depend on variables located in different tables (with the usual restriction of source vectors being the same length or of length one). In this context, it is not clear how to create a slick syntax for that, but perhaps a dedicated operator could help.

Here is a made up example (probably stupid from an aviation perspective) of what it could look like:

fuel_per_mile_per_engine <- 2.5 # in gallons
cdm_mutate(
   data = dm::cdm_nycflights13(),
   table = flights,
   fuel = distance %from% flights * engines %from% planes * fuel_per_mile_per_engine
)

table would refer to where to insert the new column, fuel = xxx would feed the ... argument and the %from% (or %$% ?) would quote both the lhs and rhs to evaluate them in the right context. If the the columns name are unique across tables, we could even imagine that the %from% tbl part could be guessed by cdm_mutate().

  • a perhaps easier (but not necessarily alternative) solution would be to implement a cdm_update_tbl() which would be similar to cdm_add_tbl() but that would allow to update an existing table. The idea is that one would use tbl() to extract the required tables, do the mutation steps (and joins if required) using {dplyr} directly and then put the table back into the dm object using the cdm_update_tbl() function. Contrary to cdm_rm_tbl() + cdm_add_tbl(), the cdm_update_tbl() would present the benefit of not loosing the definition of the primary keys and foreign keys in the process.

I know this represent some work and that you must set priorities but those would be very welcome feature additions.

Feel free to ignore that and close the issue if you think I am missing the point.

++

Alex

corner case handling of 'adapt_fk_cols()'

need to consider case where

  1. multiple tables have pk-columns of the same name and different fk-columns pointing to them from the same table.
  2. two foreign key columns from one table point to one and the same pk-column of another table

Implement cdm_rename()

Required for #40. Same syntax as cdm_filter() .

cdm_select() is similar but has the additional complexity that removed columns need to be taken care of. Later.

Default settings for `cdm_copy_to()`

The default parameters setting for cdm_copy_to() should produce temporary tables with unique_table_names.

Idea is to have the default unique_table_names = NULL, which becomes TRUE when temporary = TRUE and FALSE otherwise.

Could either be fixed in #79 or in a future PR.

explanation why not key candidate

The tibbles produced by the functions cdm_enum_pk_candidates() and cdm_enum_fk_candidates() could include one more column in which it is explained (by a code or verbose), why the column was rejected as a key candidate.

Multi-joins

with cdm_join_tbl() we are able to join two tables of a dm which share a direct link (FK).

what if we want to join a further table from the same dm in a second step etc.?

A solution would be a new function, which wraps cdm_join_tbl() and then returns the original dm, but with the result of the join as an additional table. This way, there would be the possibility of adding a secondary join to this table in the pipeline.

Questions would be:

  • what will the PK of the new table be (depends probably on the type of the join and the type of the relation between the joined tables)?
  • how to determine the FKs from and to the newly created table?

Define filtering operation

  • referential integrity of the result guaranteed
  • filter then flatten is the same as flatten with inner join then filter (?)

In the documentation in ?cdm_filter and vignette("dm-filtering") .

Redesign constructors and validators

  • dm() should be replaced by logic to extract pk/fk information from a datamodelr object
  • new_dm() becomes dm(), no longer accepts datamodelr object
  • new_dm2() becomes new_dm() -- a thin wrapper around tibble() after #57. Need to revisit interface carefully. Maybe extract patch_dm() to update parts of an existing dm?

Using git2rdata to store dm objects as plain text files suitable for version control

Interesting package.

@krlmlr and I had a chat at useR!2019 about linking the dm package with the git2rdata package(https://ropensci.github.io/git2rdata/).

I would suggest that I send a PR that uses git2rdata to read and write a dm object to plain text files suitable for storage under version control. A bit similar to cdm_copy_to().

Here are a few ideas.

  • make cdm_copy_to() into an S3 generic with methods for src_dbi, git_repository and character. The method for src_dbi is the current implementation. git_repository writes tables to a path in the root of git repo. character assumes that the value is the path to the root of a project folder.
  • git2rdata writes the files into a subfolder of a repository. The user can specify this subfolder. Equivalent to selecting the database.
  • each table is stored as a git2rdata object (tsv + yml file), using the name of the table in the dm object
  • how to store the src component? Or could the dm class also accept a path or url as source?
  • the datamodel object seems to be in a yaml format, so we can store that as is. Maybe with some protected name? To avoid that there is a table with the same name.

Remove disconnected tables

Goal: For a key column (primary/foreign), remove all references and all tables that become disconnected if the key column is removed.

Implement cdm_get_con()

Like cdm_get_src() but returns a database connection (or an error if a data frame or other object).

Provide way to check referential integrity

After b23b969, we need an easy way to check integrity of all pk and fk constraints, with an option to show where integrity is violated:

  • one function that returns a data frame with results of the integrity check
  • one function that throws an error if integrity is violated, calls the first function

What's a good name?

as_dm() for lists of remote tables

doesn't work yet:

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(dm)

as_dm(list(a = memdb_frame(a = 1)))
#> Error in UseMethod("as.data_model"): no applicable method for 'as.data_model' applied to an object of class "NULL"

Created on 2019-08-13 by the reprex package (v0.3.0)

function `cdm_reset_filter()`

This function would remove either all filters, or better, it let's you remove the n-th filter, cause you made a typo there, and you don't want to remove all filters at once because the rest seems fine. Maybe two different functions for those two scenarios?

Distinguish between unit tests and integration tests

Each function/operation should have several unit tests and one or two integration tests.

Each R/*.R file should have a corresponding tests/testthat/test-*.R file. The tests in that file should touch (almost) all code in that *.R file. (How to verify? Can we easily collect this from covr? Perhaps by running repeatedly covr::package_coverage(type = "none", code = 'devtools::test(pattern = "^...$")') .)

Implement for new tests, close this issue when all existing tests have been adapted.

unit tests

  • small examples on local data frames only
  • focusing on a specific behavior
  • easy to understand and to rerun in the console

intgegration tests

  • one larger complex-ish example
  • running on all sources including databases
  • known input, known output

Filter cascades create too complex SQL

With one filter, everything seems fine. When a second filter is added, the first filter is applied redundantly (via joins) for every table in the data model. This may lead to an overload of the query planner.

Potential solutions:

  • Add a new verb for applying referential integrity constraints, don't apply them yet after each filter
  • Materialize enumerations of IDs for each table
  • ?
library(tidyverse)
library(dm)

dm <-
  cdm_nycflights13() %>%
  cdm_select_tbl(-weather) %>%
  cdm_filter(flights, month == 1, day == 1) %>%
  cdm_copy_to(dbplyr::src_memdb(), .)

dm %>%
  cdm_filter(flights, year == 1) %>%
  cdm_get_tables() %>%
  map(dbplyr::sql_render)
#> $airlines
#> <SQL> SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )
#> 
#> $airports
#> <SQL> SELECT * FROM `airports` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )
#> 
#> $flights
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)
#> 
#> $planes
#> <SQL> SELECT * FROM `planes` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )

dm %>%
  cdm_filter(flights, year == 1) %>%
  cdm_filter(airlines, name == "xyz") %>%
  cdm_get_tables() %>%
  map(dbplyr::sql_render)
#> $airlines
#> <SQL> SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')
#> 
#> $airports
#> <SQL> SELECT * FROM (SELECT * FROM `airports` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )
#> 
#> $flights
#> <SQL> SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )
#> 
#> $planes
#> <SQL> SELECT * FROM (SELECT * FROM `planes` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT * FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM (SELECT * FROM `airlines` AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM `flights`
#> WHERE (`year` = 1.0)) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> ))
#> WHERE (`name` = 'xyz')) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )) AS `RHS`
#>   WHERE (`LHS`.`tailnum` = `RHS`.`tailnum`)
#> )

Created on 2019-08-14 by the reprex package (v0.3.0)

Implement cdm_disambiguate_cols()

Goal: rename all columns that exist in more than one table. Prefixed by the table name with a separator, . as default.

With optional verbose output (quiet = FALSE).

Should also take care of updating pk and fk.

Implement cdm_add_tbl() and cdm_rm_tbl()

For adding, use def[0, ] as boilerplate for the new row in the def. Use vctrs::vec_rbind() for row-binding tibbles.

Removing is implemented for symmetry, the documentation should point to cdm_select_tbl() .

Use @rdname to document in the same .Rd file.

Rewrite all examples and documentation to use this new idiom.

additions to `dm_flatten_to_tbl()`

IMHO cdm_flatten_to_tbl() should feature:

  1. a logical parameter drop_key_cols, since we often do not want to keep the code-columns used as keys, but only the real info (then reunite_parent_child() might come in handy)
  2. the user should be informed by a message which tables are joined to the start-table

Simplify code for querying and updating dm objects

Currently, we always call new_dm() with a fully constructed datamodelr object. The purpose of #9 was to introduce a simpler internal data structure which now can be used for updating a dm.

We should offer a way to construct a dm object from internal components only, essentially extracting an internal constructor from new_dm(). Also, we should provide accessor functions for the new components. Then, all code can work on the internal data structure. This should simplify the codebase a lot.

https://github.com/krlmlr/dm/pull/9/files#r306247485

As a result, calls to cdm_get_data_model() should become very rare, needed mostly for cdm_draw() and friends.

https://github.com/krlmlr/dm/pull/9/files#r306248024

Extract expect_cdm_error

and replace everywhere:

expect_cdm_error <- function(expr, class) {
  expect_error(expr, cdm_error(class))
}

Visualisation of column acting as foreign key more than once

If a column is a foreign key more than once, it also appears in the visualisation more than once. It should only occur once.

library(dm)
library(tibble)
a <- tibble(a = 1:3)
b <- tibble(b = 1:5)
c <- tibble(c = 1:4)

as_dm(list(a = a, b = b, c = c)) %>% 
  cdm_add_pk(b, b) %>% 
  cdm_add_pk(c, c) %>% 
  cdm_add_fk(a, a, b) %>% 
  cdm_add_fk(a, a, c) %>% cdm_draw()

Created on 2019-08-27 by the reprex package (v0.3.0)

New operation: disentangle

dm in, dm, out.

Starting from a table, create copies for all tables that can be reached through more than one way in the relation graph.

Example: flights -> airports has two paths, would create two identical copies origin.airports and dest.airports (or origin and dest) in the resulting dm, breaking the cycle.

Need to think about how this can work with deep hierarchies.

Implementation:

  • dfs or bfs
  • compute induced subgraph by the nodes visited
  • nodes with more than one incoming edge in the induced subgraph must be duplicated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.