Coder Social home page Coder Social logo

ropensci / drake Goto Github PK

View Code? Open in Web Editor NEW
1.3K 34.0 131.0 94.27 MB

An R-focused pipeline toolkit for reproducibility and high-performance computing

Home Page: https://docs.ropensci.org/drake

License: GNU General Public License v3.0

R 99.52% Shell 0.30% TeX 0.10% C 0.08%
reproducibility high-performance-computing r data-science drake makefile pipeline workflow reproducible-research rstats

drake's Introduction

infographic

Usage Release Development
Licence CRAN check
minimal R version cran-checks lint
rOpenSci Codecov
downloads JOSS
Zenodo superseded lifecycle

drake is superseded. Consider targets instead.

As of 2021-01-21, drake is superseded. The targets R package is the long-term successor of drake, and it is more robust and easier to use. Please visit https://books.ropensci.org/targets/drake.html for full context and advice on transitioning.

The drake R package logo

Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?

For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

Video

That Feeling of Workflowing (Miles McBain)

workflowing

(By Miles McBain; venue, resources)

rOpenSci Community Call

commcall

(resources)

What gets done stays done.

Too many data science projects follow a Sisyphean loop:

  1. Launch the code.
  2. Wait while it runs.
  3. Discover an issue.
  4. Rerun from scratch.

For projects with long runtimes, this process gets tedious. But with drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

How it works

To set up a project, load your packages,

library(drake)
library(dplyr)
library(ggplot2)
library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following objects are masked from 'package:drake':
#> 
#>     expand, gather

load your custom functions,

create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone)) +
    theme_gray(24)
}

check any supporting files (optional),

# Get the files with drake_example("main").
file.exists("raw_data.xlsx")
#> [1] TRUE
file.exists("report.Rmd")
#> [1] TRUE

and plan what you are going to do.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
  hist = create_plot(data),
  fit = lm(Ozone ~ Wind + Temp, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

plan
#> # A tibble: 5 x 2
#>   target   command                                                              
#>   <chr>    <expr_lst>                                                           
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                        …
#> 2 data     raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TR…
#> 3 hist     create_plot(data)                                                   …
#> 4 fit      lm(Ozone ~ Wind + Temp, data)                                       …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_out("re…

So far, we have just been setting the stage. Use make() or r_make() to do the real work. Targets are built in the correct order regardless of the row order of plan.

make(plan) # See also r_make().
#> ▶ target raw_data
#> ▶ target data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report

Except for files like report.html, your output is stored in a hidden .drake/ folder. Reading it back is easy.

readd(data) # See also loadd().
#> # A tibble: 153 x 6
#>    Ozone Solar.R  Wind  Temp Month   Day
#>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  41       190   7.4    67     5     1
#>  2  36       118   8      72     5     2
#>  3  12       149  12.6    74     5     3
#>  4  18       313  11.5    62     5     4
#>  5  42.1      NA  14.3    56     5     5
#>  6  28        NA  14.9    66     5     6
#>  7  23       299   8.6    65     5     7
#>  8  19        99  13.8    59     5     8
#>  9   8        19  20.1    61     5     9
#> 10  42.1     194   8.6    69     5    10
#> # … with 143 more rows

You may look back on your work and see room for improvement, but it’s all good! The whole point of drake is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width.

readd(hist)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So let’s fix the plotting function.

create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), binwidth = 10) +
    theme_gray(24)
}

drake knows which results are affected.

vis_drake_graph(plan) # See also r_vis_drake_graph().

hist1

The next make() just builds hist and report.html. No point in wasting time on the data or model.

make(plan) # See also r_make().
#> ▶ target hist
#> ▶ target report
loadd(hist)
hist

Reproducibility with confidence

The R community emphasizes reproducibility. Traditional themes include scientific replicability, literate programming with knitr, and version control with git. But internal consistency is important too. Reproducibility carries the promise that your output matches the code and data you say you used. With the exception of non-default triggers and hasty mode, drake strives to keep this promise.

Evidence

Suppose you are reviewing someone else’s data analysis project for reproducibility. You scrutinize it carefully, checking that the datasets are available and the documentation is thorough. But could you re-create the results without the help of the original author? With drake, it is quick and easy to find out.

make(plan) # See also r_make().
#> ℹ unloading 1 targets from environment
#> ✓ All targets are already up to date.

outdated(plan) # See also r_outdated().
#> character(0)

With everything already up to date, you have tangible evidence of reproducibility. Even though you did not re-create the results, you know the results are recreatable. They faithfully show what the code is producing. Given the right package environment and system configuration, you have everything you need to reproduce all the output by yourself.

Ease

When it comes time to actually rerun the entire project, you have much more confidence. Starting over from scratch is trivially easy.

clean()    # Remove the original author's results.
make(plan) # Independently re-create the results from the code and input data.
#> ▶ target raw_data
#> ▶ target data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report

Big data efficiency

Select specialized data formats to increase speed and reduce memory consumption. In version 7.5.2.9000 and above, the available formats are “fst” for data frames (example below) and “keras” for Keras models (example here).

library(drake)
n <- 1e8 # Each target is 1.6 GB in memory.
plan <- drake_plan(
  data_fst = target(
    data.frame(x = runif(n), y = runif(n)),
    format = "fst"
  ),
  data_old = data.frame(x = runif(n), y = runif(n))
)
make(plan)
#> target data_fst
#> target data_old
build_times(type = "build")
#> # A tibble: 2 x 4
#>   target   elapsed              user                 system    
#>   <chr>    <Duration>           <Duration>           <Duration>
#> 1 data_fst 13.93s               37.562s              7.954s    
#> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s

History and provenance

As of version 7.5.2, drake tracks the history and provenance of your targets: what you built, when you built it, how you built it, the arguments you used in your function calls, and how to get the data back. (Disable with make(history = FALSE))

history <- drake_history(analyze = TRUE)
history
#> # A tibble: 12 x 11
#>    target current built exists hash  command   seed runtime na.rm quiet
#>    <chr>  <lgl>   <chr> <lgl>  <chr> <chr>    <int>   <dbl> <lgl> <lgl>
#>  1 data   TRUE    2020… TRUE   11e2… "raw_d… 1.29e9 0.011   TRUE  NA   
#>  2 data   TRUE    2020… TRUE   11e2… "raw_d… 1.29e9 0.00400 TRUE  NA   
#>  3 fit    TRUE    2020… TRUE   3c87… "lm(Oz… 1.11e9 0.006   NA    NA   
#>  4 fit    TRUE    2020… TRUE   3c87… "lm(Oz… 1.11e9 0.002   NA    NA   
#>  5 hist   FALSE   2020… TRUE   88ae… "creat… 2.10e8 0.011   NA    NA   
#>  6 hist   TRUE    2020… TRUE   0304… "creat… 2.10e8 0.003   NA    NA   
#>  7 hist   TRUE    2020… TRUE   0304… "creat… 2.10e8 0.009   NA    NA   
#>  8 raw_d… TRUE    2020… TRUE   855d… "readx… 1.20e9 0.02    NA    NA   
#>  9 raw_d… TRUE    2020… TRUE   855d… "readx… 1.20e9 0.0330  NA    NA   
#> 10 report TRUE    2020… TRUE   5504… "rmark… 1.30e9 1.31    NA    TRUE 
#> 11 report TRUE    2020… TRUE   5504… "rmark… 1.30e9 0.413   NA    TRUE 
#> 12 report TRUE    2020… TRUE   5504… "rmark… 1.30e9 0.475   NA    TRUE 
#> # … with 1 more variable: output_file <chr>

Remarks:

  • The quiet column appears above because one of the drake_plan() commands has knit(quiet = TRUE).
  • The hash column identifies all the previous versions of your targets. As long as exists is TRUE, you can recover old data.
  • Advanced: if you use make(cache_log_file = TRUE) and put the cache log file under version control, you can match the hashes from drake_history() with the git commit history of your code.

Let’s use the history to recover the oldest histogram.

hash <- history %>%
  filter(target == "hist") %>%
  pull(hash) %>%
  head(n = 1)
cache <- drake_cache()
cache$get_value(hash)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Independent replication

With even more evidence and confidence, you can invest the time to independently replicate the original code base if necessary. Up until this point, you relied on basic drake functions such as make(), so you may not have needed to peek at any substantive author-defined code in advance. In that case, you can stay usefully ignorant as you reimplement the original author’s methodology. In other words, drake could potentially improve the integrity of independent replication.

Readability and transparency

Ideally, independent observers should be able to read your code and understand it. drake helps in several ways.

  • The drake plan explicitly outlines the steps of the analysis, and vis_drake_graph() visualizes how those steps depend on each other.
  • drake takes care of the parallel scheduling and high-performance computing (HPC) for you. That means the HPC code is no longer tangled up with the code that actually expresses your ideas.
  • You can generate large collections of targets without necessarily changing your code base of imported functions, another nice separation between the concepts and the execution of your workflow

Scale up and out.

Not every project can complete in a single R session on your laptop. Some projects need more speed or computing power. Some require a few local processor cores, and some need large high-performance computing systems. But parallel computing is hard. Your tables and figures depend on your analysis results, and your analyses depend on your datasets, so some tasks must finish before others even begin. drake knows what to do. Parallelism is implicit and automatic. See the high-performance computing guide for all the details.

# Use the spare cores on your local machine.
make(plan, jobs = 4)

# Or scale up to a supercomputer.
drake_hpc_template_file("slurm_clustermq.tmpl") # https://slurm.schedmd.com/
options(
  clustermq.scheduler = "clustermq",
  clustermq.template = "slurm_clustermq.tmpl"
)
make(plan, parallelism = "clustermq", jobs = 4)

With Docker

drake and Docker are compatible and complementary. Here are some examples that run drake inside a Docker image.

Alternatively, it is possible to run drake outside Docker and use the future package to send targets to a Docker image. drake’s Docker-psock example demonstrates how. Download the code with drake_example("Docker-psock").

Installation

You can choose among different versions of drake. The CRAN release often lags behind the online manual but may have fewer bugs.

# Install the latest stable release from CRAN.
install.packages("drake")

# Alternatively, install the development version from GitHub.
install.packages("devtools")
library(devtools)
install_github("ropensci/drake")

Function reference

The reference section lists all the available functions. Here are the most important ones.

  • drake_plan(): create a workflow data frame (like my_plan).
  • make(): build your project.
  • drake_history(): show what you built, when you built it, and the function arguments you used.
  • r_make(): launch a fresh callr::r() process to build your project. Called from an interactive R session, r_make() is more reproducible than make().
  • loadd(): load one or more built targets into your R session.
  • readd(): read and return a built target.
  • vis_drake_graph(): show an interactive visual network representation of your workflow.
  • recoverable(): Which targets can we salvage using make(recover = TRUE) (experimental).
  • outdated(): see which targets will be built in the next make().
  • deps_code(): check the dependencies of a command or function.
  • drake_failed(): list the targets that failed to build in the last make().
  • diagnose(): return the full context of a build, including errors, warnings, and messages.

Documentation

Core concepts

The following resources explain what drake can do and how it works. The workshop at https://github.com/wlandau/learndrake devotes particular attention to drake’s mental model.

In practice

  • Miles McBain’s excellent blog post explains the motivating factors and practical issues {drake} solves for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles.
  • Miles’ dflow package generates the file structure for a boilerplate drake project. It is a more thorough alternative to drake::use_drake().
  • drake is heavily function-oriented by design, and Miles’ fnmate package automatically generates boilerplate code and docstrings for functions you mention in drake plans.

Reference

Use cases

The official rOpenSci use cases and associated discussion threads describe applications of drake in the real world. Many of these use cases are linked from the drake tag on the rOpenSci discussion forum.

Here are some additional applications of drake in real-world projects.

drake projects as R packages

Some folks like to structure their drake workflows as R packages. Examples are below. In your own analysis packages, be sure to call drake::expose_imports(yourPackage) so drake can watch you package’s functions for changes and rebuild downstream targets accordingly.

Help and troubleshooting

The following resources document many known issues and challenges.

If you are still having trouble, please submit a new issue with a bug report or feature request, along with a minimal reproducible example where appropriate.

The GitHub issue tracker is mainly intended for bug reports and feature requests. While questions about usage etc. are also highly encouraged, you may alternatively wish to post to Stack Overflow and use the drake-r-package tag.

Contributing

Development is a community effort, and we encourage participation. Please read CONTRIBUTING.md for details.

Similar work

drake enhances reproducibility and high-performance computing, but not in all respects. Literate programming, local library managers, containerization, and strict session managers offer more robust solutions in their respective domains. And for the problems drake does solve, it stands on the shoulders of the giants that came before.

Pipeline tools

GNU Make

The original idea of a time-saving reproducible build system extends back at least as far as GNU Make, which still aids the work of data scientists as well as the original user base of complied language programmers. In fact, the name “drake” stands for “Data Frames in R for Make”. Make is used widely in reproducible research. Below are some examples from Karl Broman’s website.

Whereas GNU Make is language-agnostic, drake is fundamentally designed for R.

  • Instead of a Makefile, drake supports an R-friendly domain-specific language for declaring targets.
  • Targets in GNU Make are files, whereas targets in drake are arbitrary variables in memory. (drake does have opt-in support for files via file_out(), file_in(), and knitr_in().) drake caches these objects in its own storage system so R users rarely have to think about output files.

Remake

remake itself is no longer maintained, but its founding design goals and principles live on through drake. In fact, drake is a direct re-imagining of remake with enhanced scalability, reproducibility, high-performance computing, visualization, and documentation.

Factual’s Drake

Factual’s Drake is similar in concept, but the development effort is completely unrelated to the drake R package.

Other pipeline tools

There are countless other successful pipeline toolkits. The drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and a thorough selection of parallel computing technologies and scheduling algorithms.

Memoization

Memoization is the strategic caching of the return values of functions. It is a lightweight approach to the core problem that drake and other pipeline tools are trying to solve. Every time a memoized function is called with a new set of arguments, the return value is saved for future use. Later, whenever the same function is called with the same arguments, the previous return value is salvaged, and the function call is skipped to save time. The memoise package is the primary implementation of memoization in R.

Memoization saves time for small projects, but it arguably does not go far enough for large reproducible pipelines. In reality, the return value of a function depends not only on the function body and the arguments, but also on any nested functions and global variables, the dependencies of those dependencies, and so on upstream. drake tracks this deeper context, while memoise does not.

Literate programming

Literate programming is the practice of narrating code in plain vernacular. The goal is to communicate the research process clearly, transparently, and reproducibly. Whereas commented code is still mostly code, literate knitr / R Markdown reports can become websites, presentation slides, lecture notes, serious scientific manuscripts, and even books.

knitr and R Markdown

drake and knitr are symbiotic. drake’s job is to manage large computation and orchestrate the demanding tasks of a complex data analysis pipeline. knitr’s job is to communicate those expensive results after drake computes them. knitr / R Markdown reports are small pieces of an overarching drake pipeline. They should focus on communication, and they should do as little computation as possible.

To insert a knitr report in a drake pipeline, use the knitr_in() function inside your drake plan, and use loadd() and readd() to refer to targets in the report itself. See an example here.

Version control

drake is not a version control tool. However, it is fully compatible with git, svn, and similar software. In fact, it is good practice to use git alongside drake for reproducible workflows.

However, data poses a challenge. The datasets created by make() can get large and numerous, and it is not recommended to put the .drake/ cache or the .drake_history/ logs under version control. Instead, it is recommended to use a data storage solution such as DropBox or OSF.

Containerization and R package environments

drake does not track R packages or system dependencies for changes. Instead, it defers to tools like Docker, Singularity, renv, and packrat, which create self-contained portable environments to reproducibly isolate and ship data analysis projects. drake is fully compatible with these tools.

workflowr

The workflowr package is a project manager that focuses on literate programming, sharing over the web, file organization, and version control. Its brand of reproducibility is all about transparency, communication, and discoverability. For an example of workflowr and drake working together, see this machine learning project by Patrick Schratz.

Citation

citation("drake")
#> 
#> To cite drake in publications use:
#> 
#>   William Michael Landau, (2018). The drake R package: a pipeline
#>   toolkit for reproducibility and high-performance computing. Journal
#>   of Open Source Software, 3(21), 550,
#>   https://doi.org/10.21105/joss.00550
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {The drake R package: a pipeline toolkit for reproducibility and high-performance computing},
#>     author = {William Michael Landau},
#>     journal = {Journal of Open Source Software},
#>     year = {2018},
#>     volume = {3},
#>     number = {21},
#>     url = {https://doi.org/10.21105/joss.00550},
#>   }

Acknowledgements

Special thanks to Jarad Niemi, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. He originally set me down the path that led to drake.

Many thanks to Julia Lowndes, Ben Marwick, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement as the editor. Thanks also to the following people for contributing early in development.

Credit for images is attributed here.

ropensci_footer

drake's People

Contributors

billdenney avatar bmchorse avatar boshek avatar bpbond avatar brendanf avatar chrismuir avatar crerecombinase avatar gadenbuie avatar kendonb avatar krlmlr avatar maelle avatar malcolmbarrett avatar matthiasgomolka avatar maurolepore avatar milesmcbain avatar noamross avatar norival avatar pat-s avatar rkrug avatar shrektan avatar smingerson avatar strazto avatar thebioengineer avatar tiernanmartin avatar tjmahr avatar uribo avatar vkehayas avatar wlandau avatar wlandau-lilly avatar xiaodaigh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drake's Issues

Basic example should be platform independent

In the upcoming version 2.1.0, I hid the cluster-specific code in an if(FALSE) block. The next update already has a platform-independent version of inst/examples/basic/basic.R. The code is ready, but my company needs to screen it before I release it. Stay tuned, everyone.

parallel computing with parallel::mclapply()

Already implemented in version 2.0.0. It was super easy because the new version is totally oriented around igraph. I can't disclose any of that yet because my company's disclosure practices are super slow, but development is going incredibly well.

Hybrid parallel computing options

In make(), the parallelism argument can be "mclapply" (current default on non-Windows systems), "parLapply" (current default on Windows), or "Makefile". The first two parallelize within a single R session across multiple cores, and the third parallelizes over multiple R sessions and potentially multiple nodes on a cluster. What if we want both kinds of parallelism? The best high-performance computing solutions often have parallelism both among nodes and among cores within nodes. I wonder how to tell drake to partition the targets, use "Makefile" to parallelize among groups, and use one of the other options to parallelize within groups. Maybe parallel_stage() can handle it somehow.

Help users protect the workspace (slick function to create a custom evaluation environment for make()).

In the upcoming version 3.0.0, drake's execution environment is the user's workspace by default. As an upshot, the workspace is vulnerable to side effects of make(). To protect your workspace, you may want to create a custom evaluation environment containing all your imported objects and then pass it to the envir argument of make(). Here is how.

library(drake)
envir = new.env(parent = globalenv())
eval(expression({
  f = function(x){
    g(x) + 1
  }
  g = function(x){
    x + 1
  }
}), envir = envir)
myplan = plan(out = f(1:3))
make(myplan, envir = envir)
ls() # Check that your workspace did not change.
ls(envir) # Check your evaluation environment.

I am wondering if I should implement a slick shorthand for eval(expression(...), envir = envir).

Refactor environments and scoping

#35 is an example of unexpected behavior in an edge case due to drake's policy on environments and scoping. To make behavior more predictable in the next major version (3.0.0), drake will stop micromanaging environments. Targets will be evaluated in and assigned to the user's actual environment, not an isolated deep copy, and the enclosing environments of functions will no longer be reassigned. By default, your workspace will be vulnerable to side effects of make(), but you can always use the envir argument of make() to impose protection.

Change how the names of timestamp files are generated.

Base 64 encoding uses both uppercase and lowercase letters, and I did not realize until now that this behavior could cause collisions in case-insensitive file systems like Windows. The risk is low, but it is enough to switch from base64url::base64_urlencode() to base64url::base32_encode() to compute the names of timestamp files. This only affects how Makefiles are written and executed, and a new Makefile is generated on every call to drake::make(..., parallelism = "Makefile"), so this fix will not affect back compatibility. Existing projects will be unharmed.

Rehash files when and only when necessary. Use timestamps to decide.

I use fingerprints (hashes) to detect when the user's external files change. Fingerprints are expensive, so I use file modification times to judge whether fingerprinting is even worth the expense. The trouble is that file.mtime() is egregiously imprecise on Windows and Mac, so true updates to files could potentially be missed. My current workaround is to force a rehash when

  1. the file is less than 100 Kb, or
  2. the file was just built by drake (not imported)

This rule mostly covers it, but manual changes to any medium-to-large file may be ignored if drake looks at that file in the same second. I would really like to just get more precise timestamps. With even millisecond precision, I could just wait until the next increment in mtime before importing or building the next file.

Sometime in the future, I may be able to assume all file systems support high-resolution times. Apparently, R 3.3.3 will have this solved for Windows. But until that happens on all platforms, I do not think I can solve this issue.

Possible race condition (unpredictable results based on parallel computing)

The problem

A race condition happens when the final result of a program depends on the unpredictable execution order of parallel tasks. In the case of drake, commands that share intermediate variables could interfere with each other. Here is an example.

library(drake)
myplan = plan(list = c(a = "i <- 1; i", b = "i <- 2; i"))

Both targets in myplan assign the intermediate variable i to the same environment.

> myplan
  target   command
1      a i <- 1; i
2      b i <- 2; i

In theory, make(myplan, jobs = 2) could incorrectly assign 2 to a or 1 to b.

Workarounds

The easiest workaround is just to make each command a function call or a set of nested function calls so that local variables are protected. If this does not suit your project, I recommend enclosing each command in a call to an anonymous function.

functionize = function(command){
  paste0("(function(){\n", command, "\n})()")
}
myplan$command = functionize(myplan$command)

That way, make(myplan, jobs = 2) is safe because the variable i is protected from drake's execution environment in a local scope.

> myplan
  target                       command
1      a (function(){\ni <- 1; i\n})()
2      b (function(){\ni <- 2; i\n})()

Permanent solution

I have already inserted functionize() inside build() in the closed-source version, so this issue will be resolved in the next release and disclosure. Stay tuned for version 3.0.0, which will arrive as soon as I can disclose it.

In status(), deprecate imported_files_only in favor of no_imported_objects

Currently, status(..., imported_files_only = TRUE) lists the statuses of imported files and targets with commands. Here, the name imported_files_only confusing and misleading. A clearer replacement would be status(..., no_imported_objects = TRUE), which would also align with cached(..., no_imported_objects = TRUE).

Local variables in functions are sometimes confused with code dependencies.

f = function(){
  y = 1
  return(y)
}

my_plan = plan(out = f())
make(my_plan) # correctly builds `out`
make(my_plan) # correctly skips `out`
y = 2 # gets confused as a dependency of f
make(my_plan) # incorrectly builds `out`. Not a disaster, but totally unnecessary.
readd(out) # correctly returns 1

I know I should be using codetools::findGlobals() rather than pryr::call_tree(). That way, local variables inside functions aren't confused with dependencies. I already solved this in the development branch, but the new code needs to be reviewed, and it will take some time to disclose it from my company.

Users need to call library(drake) before drake::make()

Users need to call library(drake) before using any of drake's main functions. Because of a careless error on my part, drake::make() does not currently work if drake is not already loaded.

Here is the problem. I use drake functions to set default arguments to other drake functions. For example, in config(), I use the parallelism_choices() function to sanitize input to the parallelism argument. In the case of config(), I have something like

config = function(parallelism = parallelism_choices() .......... 

instead of

config = function(parallelism = drake::parallelism_choices() .......... 

I fixed and tested all this in the closed-source repo, and I will disclose as soon as I can.

Workflow plan data frames must NOT HAVE FACTORS!

Current behavior:

> library(drake)
> myplan = data.frame(target = "a", command = "sqrt(5)")
> str(myplan)
'data.frame':   1 obs. of  2 variables:
 $ target : Factor w/ 1 level "a": 1
 $ command: Factor w/ 1 level "sqrt(5)": 1
> make(myplan)
import sqrt
build a
> readd(a) # INCORRECT
[1] 1
> myplan = data.frame(target = "a", command = "sqrt(5)", stringsAsFactors = FALSE)
'data.frame':   1 obs. of  2 variables:
 $ target : chr "a"
 $ command: chr "sqrt(5)"
> str(myplan)
> make(myplan)
import sqrt
build a
> readd(a) # CORRECT
[1] 2.236068

Fixed in the upcoming patch. Stay tuned!

When an imported function is cached, its environment is lost.

Related to #35. This issue becomes relevant when the user calls readd() or loadd().

> library(drake)
> f = Vectorize(function(x){x + 1}, "x")
> environment(f) # CORRECT
<environment: 0x3f09fa8>
> ls(environment(f)) # CORRECT
[1] "arg.names"      "collisions"     "FUN"            "FUNV"
[5] "SIMPLIFY"       "USE.NAMES"      "vectorize.args"
> myplan = plan(x = f(1:10))
> make(myplan, verbose = FALSE)
Error in match(x, table, nomatch = 0L) :
  object 'vectorize.args' not found
> loadd(f)
> environment(f) # NOT THE DESIRED RESULT
<environment: R_GlobalEnv>
> ls(environment(f)) # NOT THE DESIRED RESULT
 [1] "f"  "myplan"

Solution: cache the function itself in a separate storr namespace called "functions". (Currently, only the deparsed text of the function is cached.) Then, recover the un-deparsed function in readd(). Since loadd() calls readd(), this should only need to be fixed in one place.

Imported functions created with Vectorize() do not work.

Current behavior

> library(drake)
> f = Vectorize(function(x){
+ x + 1
+ }, "x")
> myplan = plan(y = f(1:10))
> make(myplan)
import c
import as.list
import character
import do.call
import eval
import FUN
could not find FUN
import is.null
import lapply
import length
import list
import match.call
import parent.frame
import SIMPLIFY
could not find SIMPLIFY
import USE.NAMES
could not find USE.NAMES
import vectorize.args
could not find vectorize.args
import f
build y
Error in match(x, table, nomatch = 0L) :
  object 'vectorize.args' not found

Easy workaround

Enclose your vectorized function in a wrapper function.

> library(drake)
> h = function(z){
+   f = Vectorize(function(x){
+    x + 1
+   }, "x")
+   f(z)
+ }
> myplan = plan(y = h(1:10))
> make(myplan)
import Vectorize
import h
build y
> readd(y)
 [1]  2  3  4  5  6  7  8  9 10 11
>

Root cause

  • R uses lexical scoping, which affects how non-local variables of functions are found. Non-local variables are found in the environment where the function was originally defined (parent.env()), not the environment where the function is called (parent.frame()).
  • f() was defined with Vectorize(), so it has non-local variables SIMPLIFY, USE.NAMES, etc. in its closure. You can verify this with ls(environment(f)). See Hadley's intro to function environments.
  • In make(), drake creates its own special environment from scratch and assigns all imported functions to that environment. (Otherwise, calls to nested imported functions fail.) Thus, all symbols like SIMPLIFY and USE.NAMES are looked up in an environment different than the original environment(f).

Solution

Sathish solved this specific scenario on StackOverflow. However, this is just one of a handful of edge cases due to the root cause above. To make behavior more predictable, drake needs to stop micromanaging environments and functions altogether and just work in the user's environment. See #37.

make(..., parallelism = "Makefile") fails to write file targets in project subdirectories

I'm really kicking myself for this one. If you have a file target named something like 'my_folder/my_file.csv', make(..., parallelism = "Makefile") quits in error. The dummy timestamp file is .drake/ts/'my_folder/my_file.csv', which cannot be created outright using file.create(). Even without the single quotes, .drake/ts/my_folder/my_file.csv would be in the my_folder subdirectory, which would need to be created beforehand.

But don't worry, this was an easy fix. I just borrowed from base64url::base64_urlencode() to turn target names like 'my_folder/my_file.csv' into machine-friendly strings. The closed-source version is already patched, and I will upload the fix as soon as I can.

Directories (folders) are not reproducibly tracked.

Yes, you can declare a file target or input file by enclosing it in single quotes in your workflow plan data frame. But entire directories (i.e. folders) cannot yet be tracked this way. This is a trickier problem to solve, and lots of individual edge cases need to be ironed out before I can deliver a clean, reliable implementation.

Get dependencies of knitr reports automatically.

It should be possible to

  1. Recognize a knitr report by the .Rmd or .Rnw file extension.
  2. Extract all the code chunks, including evaluated inline code.
  3. Get the objects read into the report by readd() or loadd()

But this would miss external files. On the other hand, scanning for any mention any target in any code chunk might be too aggressive.

Packages automatically detected or specified with make(..., packages = ...) must actually be installed.

Otherwise, make() could quit in error. This causes problems if you call devtools::load_all("your_package") before make() because then drake detects and tries to load "your_package". In the next patch I am preparing (unavoidably closed-source), I already fixed this problem by using require() rather than library() in add_packages_to_prework() so that a warning is generated rather than an error. But for now, I do not think this little hiccup is worth rushing another public release. If you want to call devtools::load_all("your_package"), put that line of code in the prework argument and manually set packages so that it excludes "your_package".

devtools::load_all("your_package") # Try to avoid load_all() in your regular setup code...
packages = "MASS" # ...otherwise, declare all your packages so drake does not search for them.
prework = 'devtools::load_all("your_package")' # It is best to use the prework for load_all().
make(my_plan, packages = packages, prework = prework)

Drake overlooks dependencies in some edge cases.

Next time I disclose drake from my company, the information below will be in the "caution" vignette.

Drake uses codetools::findGlobals() in the backend to look for dependencies, which can be fooled. For example, suppose you have a custom function f in your workspace.

f <- function(){
  b = get("x", envir = globalenv())
  digest::digest(readLines('my_file.txt'))
}

When drake looks for the dependencies of f, it will fail to recognize the object x, the function digest(), and the file 'my_file.txt'. Object x is referenced with quoted strings, not symbols, which tricks drake. The function digest() is referenced with the scoping rule ::, so codetools::findGlobals() does not detect it. Lastly, because 'my_file.txt' is inside a function and not a command in your workflow plan data frame, drake will not reproducibly track it.

When it comes to commands in your workflow plan data frame, there are similar issues. It is possible to use double-quoted strings and the scoping operator :: to trick drake into overlooking objects, functions, and files that should be dependencies. Use the check() function to scan the workflow plan for double-quoted strings and print out messages telling you where they occur.

If you are ever unsure about which targets and dependencies in your project are reporducibly tracked, please look at the dependency tree/graph of your workflow plan. Use build_graph() to obtain an igraph object of the dependency structure of your workflow, and use plot_graph() to make a plot of the graph.

Parallelize the imports when `parallelism` equals "Makefile"

In make(..., parallelism = "Makefile"), only the targets in your workflow plan data frame are parallelized. All the imports are computed serially, which should never have to be the case. You may have large input files to hash or a lot of web data to scrape and import, and this could take long enough that parallelism is useful. In v3.1.0 (coming soon), the imports are parallelized with either parLapply() or mclapply() (drake selects the best option for your system). As an added bonus, you will optionally have different levels of parallelism for targets vs imports.

make(..., 
  jobs = 4, # for imports
  args = "--jobs-8" # for targets in your workflow plan data frame
)

There will also be an imports_only flag in make() if you want to just import objects/files and not build any targets after that.

Clean up code in Make-class.R

The code in general gets messy because I was learning and coding at the same time. I'm willing to do a massive overhaul, but it will take some time.

Beware leading and trailing whitespace in target names... for now.

Currently, the following code makes targets named " a " and "b"

plan = data.frame(target = c("       a     ", "b"), command = 1:2)
make(plan)

Since nobody likes that, I have already tweaked the closed-source version to use stringr::str_trim(..., side = "both") so that target "a" is made instead of " a ". Will disclose with the next public release.

Long commands supplied to plan() through `...` may not be parsed correctly.

Long commands supplied to ... in plan() are mangled.

> library(drake)
> workflow = plan(x = do_simulations(arguments),
+   results = my_hypothesis_tests(data_object, argument_two = "abc",
+     argument_three = "xyz", argument_four = flag))
> workflow
   target
1
2 results
                                                                                                                    command
1 x                                                                                               do_simulations(arguments)
2 c('my_hypothesis_tests(data_object, argument_two = \\'abc\\', argument_three = \\'xyz\\', ', '    argument_four = flag)')

To avoid this problem for now, use the list argument to plan() rather than ... for long commands. Then, make(workflow) will work as usual on the corrected workflow plan data frame.

> workflow = plan(x = do_simulations(arguments),
+   list = c(results = 'my_hypothesis_tests(data_object, argument_two = "abc", argument_three = "xyz", argument_four = flag)'),
+   strings_in_dots = "literals")
> workflow
   target
1       x
2 results
                                                                                               command
1                                                                            do_simulations(arguments)
2 my_hypothesis_tests(data_object, argument_two = "abc", argument_three = "xyz", argument_four = flag)

In the closed-source repo, I paste the lines together of the output of deparse(), so this issue will be solved in the next disclosure and patch.

Remember `prepend = "SHELL=./shell.sh"` in the quickstart vignette

In order connect drake to a job scheduler like the Univa Grid Engine, you need to set the prepend argument to tell the generated Makefile how to talk to the cluster.

make(some_plan, parallelism = "Makefile", jobs = 4, # jobs can be whatever 
  prepend = "SHELL=./shell.sh") # see the quickstart vignette for shell.sh

SLURM users can just point to srun and dispense with shell.sh altogether.

make(some_plan, parallelism = "Makefile", jobs = 4,
  prepend = "SHELL=srun")

This piece is currently missing from the high-performance computing section of quickstart.Rmd. Without it, the computations will run exclusively on the head node or login node. Fortunately, the basic example has this important piece, but it should still mention SLURM.

drake does not import functions referenced with `::`

> library(drake)
> plan = plan(x = f(1))
> f = function(x) digest::digest(x)
> make(plan)
import f
build x
> f = function(x) digest(x)
> library(digest)
> make(plan)
import digest
import f
build x

I guess codetools::findGlobals() does not recognize scoped functions, which makes total sense. Since I am totally reliant on findGlobals(), I am not sure I can solve this issue cleanly. Maybe I could specifically look for :: and ::: and treat them differently, but dealing with those special cases could have unintended consequences. I'll have to think about it more before I decide on a solution.

External packages as dependencies

Sometimes I write custom one-off packages and develop them alongside the drake workflows I am working on. So maybe the code analysis should walk deeper into functions from packages. The current behavior is to walk all the way through the functions in the environment (to discover and track any functions nested in user-defined functions) but stop at functions from packages (including base).

Universal build rule for Makefiles

In a parallelRemake pull request, @krlmlr suggested a universal build rule that could clean up and shorten Makefiles. For drake, it might look something like this:

.drake/ts/%:
        Rscript -e 'drake::mk("$<")'

...except that file targets in drake are single-quoted, so they need

.drake/ts/%:
        Rscript -e 'drake::mk(drake::as_file("$<"))'

because the literal single quotes that denote file names would conflict with the single quotes required for Rscript. I'm not sure it's possible to define a universal build rule for these two separate cases, so I may not be able to solve this issue.

Play nicer with tibbles

Current behavior:

> plan = tibble::tribble(~target, ~command, "x", 1)
> drake::make(plan)
build x
Warning message:
Unknown or uninitialised column: 'output'.

make() still works, but the internal possible_targets() function generates an annoying unnecessary warning. Fixed in the closed-source development version. Will be released with the next patch.

Play nicer with devtools::load_all()

Packages loaded by the user before make() need to be loaded again for each process in some of the more advanced forms of parallel computing. That's the whole point of the packages and prework arguments.

Maybe drake should try require() first and then devtools::load_all(pkg = path_to_package) if that fails. But then I need to know path_to_package, so it may not be such a trivial fix. But since packages are loaded before any prework is done or any targets are made, this is not likely to exacerbate #13.

Release wlandau-lilly's updates faster

I work for an old-fashioned company in a highly-regulated industry. Every time I want to change the code, I have to go through a disclosure process that usually takes several weeks.

**However, I can and will accept pull requests. All known issues and new ideas are fair game.

My colleagues and I are working to change the rules so we can update our own work at a reasonable pace. Until then, many thanks for your patience.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.