wdwatkins / ds-pipelines-targets-2 Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 14 KB

Home Page: https://lab.github.com/USGS-R/usgs-targets-tips-and-tricks

R 100.00%

ds-pipelines-targets-2's People

Contributors

Watchers

ds-pipelines-targets-2's Issues

How to get past the gotchas without getting gotten again

In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:

🔍 How to debug in a pipeline
👀 Visualizing and understanding the status of dependencies in a pipeline
💬 tar_visnetwork() and tar_outdated() to further interrogate the status of pipeline targets
🔃 What is a cyclical dependency and how do I avoid it?
⚠️ Undocumented file output from a function
📂 Using a directory as a dependency
📋 How do I know when to use an object vs a file target or even use a target at all?
⚙️ USGS Data Science naming conventions
🔓 Final tips for smart pipelining

⌨️ add a comment to this issue and the bot will respond with the next topic

I'll sit patiently until you comment

USGS Data Science conventions

So far you’ve learned a lot about the mechanics of using targets, but there are also a few conventions that USGS Data Science practitioners use to maintain consistency across projects. These conventions make it easier to jump into a new project, provide peer review, or learn a new technique from someone else’s pipeline, since you are already familiar with the structure.

As you learned in the first pipelines course, we like to separate pipelines into distinct phases based on what is happening to the data (we usually use fetch, process, visualize, etc). So far in this course, we have been using a single list of targets in the _targets.R makefile. This works for short pipelines, but when you have bigger, more complex pipelines, that file and target list could get HUGE and difficult to read.

For this reason, we like to have multiple makefiles that each describe and are named after a single phase, e.g., 1_fetch.R or 4_visualize.R. Within each phase makefile, targets are saved in an R list object which is numbered based on the phase, e.g. p1_targets_list or p4_targets_list. Then, the main _targets.R makefile sources each of these phase makefiles and combines the target lists into a single list using c(), e.g., c(p1_targets_list, p2_targets_list, p3_targets_list, p4_targets_list.

In addition to this multi-makefile approach, we also like to name our targets to make it clear which phase they belong to. For example, any target created in the fetch phase would be prefixed with p1. We do this for two reasons: 1) it is clearer, and 2) you can now use dplyr::select syntax to build all targets in a single phase by running tar_make(starts_with("p1")). A handy little trick!

Consider the two-phased pipeline below, where you need to download data from ScienceBase and then combine it all into a single dataframe.

If the 1_fetch.R makefile looked like this

p1_targets_list <- list(
  tar_target(
    p1_sb_files,
    {
      dummy <- '2021-04-19'
      item_file_download(sb_id = "4f4e4acae4b07f02db67d22b", 
                         dest_dir = "1_fetch/tmp",
                         overwrite_file = TRUE)
    },
    format = "file",
    packages = "sbtools"
  )
)

and the 2_process.R makefile looked like this

source("2_process/src/combine_files.R")

p2_targets_list <- list(
  tar_target(
    p2_plot_data, 
    combine_into_df(sb_files)
  )
)

then the canonical_targets.R makefile would look like this

library(targets)

source("1_fetch.R")
source("2_process.R")

# Return the complete list of targets
c(p1_targets_list, p2_targets_list)

You could then build the full pipeline by running tar_make(), or run specific phases using tar_make(starts_with("p1")) for the fetch phase and tar_make(starts_with("p2")) for the process phase.

⌨️ Activity: Split your pipeline targets into the phases fetch, process, and visualize. Use a different makefile for each phase and follow our phase-naming conventions to name the makefiles and list objects. Also, rename your targets using the appropriate prefix (p1_, p2_, p3_). Run tar_make() and open a pull request. Paste your build status as a comment to the PR and assign your designated course instructor as a reviewer.

Once your pull request has been merged return here and comment on this issue

Exchange object and file targets in your pipelines

You should now have a working pipeline that can run with tar_make(). Your current pipeline likely only has one file target, which is the final plot.

We want you to get used to exchanging objects for files and vice versa, in order to expose some of the important differences that show up in the makefile and also in the way the functions are put together.

⌨️ Activity: Open a PR where you swap two object targets to be file targets, and change one file target to be an object target. Run tar_make and open a pull request. Paste your build status as a comment to the PR and assign your designated course instructor as a reviewer.

I'll sit patiently until your pull request has been merged

Refactor the existing pipeline to use more effective targets

⌨️ Activity: Make modifications to the working, but less than ideal, pipeline that exists within your course repository

Within the course repo you should see only a _targets.R and directories with code or placeholder files for each phase. You should be able to run tar_make() and build the pipeline, although it may take numerous tries, since some parts of this new workflow are brittle. Some hints to get you started: the site_data target is too big, and you should consider splitting it into a target for each site, perhaps using the download_nwis_site_data() function directly to write a file. Several of the site_data_ targets are too small and it might make sense to combine them.

When you are happy with your newer, better workflow, create a pull request with your changes and assign your designated course instructor as a reviewer. Add a comment to your own PR with thoughts on how you approached the task, as well as key decisions you made.

You should create a local branch called "refactor-targets" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "refactor-targets" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data" or "fix-issue-17".

git checkout -b refactor-targets
git push -u origin refactor-targets

A human will interact with your pull request once you assign them as a reviewer

Strategies for defining targets in data pipelines

How to make decisions on how many targets to use and how targets are defined

We've covered a lot of content about the rules of writing good pipelines, but pipelines are also very flexible! Pipelines can have as many or as few targets as you would like, and targets can be as big or as small as you would like. The key theme for all pipelines is that they are reproducible codebases to document your data analysis process for both humans and machines. In this next section, we will learn about how to make decisions related to the number and types of targets you add to a pipeline.

Background

Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used in complex, collaborative modeling projects, such as in this reservoir temperature modeling pipeline and in this pipeline to manage downloads of forecasted meteorological driver data. Note that you need to be able to access internal USGS websites to see these examples and these were developed early on in the Data Science adoption of targets so may not showcase all of our adopted best practices.

Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):

library(targets)

## All R files that are used must be listed here:
source("R/get_mutate_HUC8s.R")
source("R/get_wqp_data.R")
source("R/plot_huc_panel.R")

tar_option_set(packages = c("dataRetrieval", "dplyr", "httr", "lubridate", "maps",
                            "maptools", "RColorBrewer", "rgeos", "rgdal", "sp", "yaml"))

# Load configuration files
p0_targets_list <- list(
  tar_target(map_config_yml, "configs/mapping.yml", format = "file"),
  tar_target(map_config, yaml.load_file(map_config_yml)),
  tar_target(wqp_config_yml, "configs/wqp_params.yml", format = "file")
  tar_target(wqp_config, yaml.load_file(wqp_config_yml))
)

# Fetch data
p1_targets_list <- list(
  tar_target(huc_map, get_mutate_HUC8s(map_config)),
  tar_target(phosphorus_lakes, get_wqp_data("phosphorus_lakes", wqp_config, map_config)),
  tar_target(phosphorus_all, get_wqp_data("phosphorus_all", wqp_config, map_config)),
  tar_target(nitrogen_lakes, get_wqp_data("nitrogen_lakes", wqp_config, map_config)),
  tar_target(nitrogen_all, get_wqp_data("nitrogen_all", wqp_config, map_config)),
  tar_target(arsenic_lakes, get_wqp_data("arsenic_lakes", wqp_config, map_config)),
  tar_target(arsenic_all, get_wqp_data("arsenic_all", wqp_config, map_config)),
  tar_target(temperature_lakes, get_wqp_data("temperature_lakes", wqp_config, map_config)),
  tar_target(temperature_all, get_wqp_data("temperature_all", wqp_config, map_config)),
  tar_target(secchi_lakes, get_wqp_data("secchi_lakes", wqp_config, map_config)),
  tar_target(secchi_all, get_wqp_data("secchi_all", wqp_config, map_config)),
)

# Summarize the data in a plot
p2_targets_list <- list(
  tar_target(
    multi_panel_constituents_png,
    plot_huc_panel(
      "figures/multi_panel_constituents.png", huc_map, map_config, 
      arsenic_lakes, arsenic_all, nitrogen_lakes, nitrogen_all, 
      phosphorus_lakes, phosphorus_all, secchi_lakes, secchi_all, 
      temperature_lakes, temperature_all
    ),
    format = "file")
)

# Combine all targets into a single list
c(p0_targets_list, p1_targets_list, p2_targets_list)

This makefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:

The "figures/multi_panel_constituents.png" figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all is a target that contains the locations of all of the sites that have nitrogen water quality data samples).

Alternatively, the map_config* object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s function, and includes some plotting details for the final map (such as plotting color divisions).

This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.

⌨️ Activity: Assign yourself to this issue to get started.

I'll sit patiently until you've assigned yourself to this one.

Overview of data science pipelines II

Welcome to the second installment of "introduction to data pipelines" at USGS, @wdwatkins!! ✨

We're assuming you were able to navigate through the intro-to-targets-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the makefile: a way to program connections between functions, files, and phases and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.

Recap of pipelines I

First, a recap of key concepts that came from intro-to-targets-pipelines 👇

Data science work should be organized thoughtfully. As Jenny Bryan notes, "File organization and naming are powerful weapons against chaos".
Capture all of the critical phases of project work with descriptive directories and function names, including how you "got" the data.
Turn your scripts into a collection of functions, and modify your thinking to connect outputs from these functions ("targets") to generate your final product.
"Skip the work you don't need" by taking advantage of a dependency manager. There was a video that covered a bit of make, and you were asked to experiment with targets.
Invest in efficient reproducibility to scale up projects with confidence.

This last concept was not addressed directly, but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).

What's ahead in pipelines II

In this training, the focus will be on conventions and best practices for making better, smarter pipelines for USGS Data Science projects. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!

⌨️ Activity: Add collaborators and close this issue to get started.

As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the ⚙️ Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite your course instructor. It should look something like this:

💡 Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I ⏳ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my human (your designated course instructor) know if I seem to have become completely stuck.

I'll sit patiently until you've closed the issue.

Local setup

Set up your local environment before continuing

During the course, we will ask you to build the pipeline, explore how to troubleshoot, and implement some of the best practices your are learning. To do this, you will work with the pipeline locally and commit/push your changes to GitHub for review.

See details below for how to get started working with code and files that exist within the course repsository:

Open a git bash shell (Windows:diamond_shape_with_a_dot_inside:) or a terminal window (Mac:green_apple:) and change (cd) into the directory you work in for projects in R (for me, this is ~/Documents/R). There, clone the repository and set your working directory to the new project folder that was created:

git clone [email protected]:wdwatkins/ds-pipelines-targets-2.git
cd ds-pipelines-targets-2

You can also open this project in RStudio by double-clicking the .Rproj file in the ds-pipelines-targets-2 directory.

⭐ Now you have the repository locally! Follow along with the commands introduced and make changes to the code as requested throughout the remainder of the course.

⌨️ close this issue to continue!

What's next

You are awesome, @wdwatkins! 🌟 💥 🐠

We hope you've learned a lot in intro to pipelines II. We don't have additional exercises in this module, but we'd love to have a discussion if you have questions.

As a resource for later, here are links to the content you just completed

🔍 Debugging in a pipeline
👀 Visualizing pipeline dependencies
💬 tar_visnetwork() and tar_outdated() to further interrogate the status of pipeline targets
🔃 Cyclical dependencies
⚠️ Undocumented file output from a function
📂 Using a directory as a dependency
📋 Different kinds of pipeline targets, including objects and files
⚙️ Conventions followed in USGS Data Science pipelines
🔓 .gitignore and commenting pipeline code

⌨️ You have now completed the course 🎉 If you have comments or questions, add them below and then assign a course lead this issue to engage in dialogue.

Learn the differences between different types of targets

Targets

"Targets" are the main things that the targets package interacts with (if the name hadn't already implied that 🤪). They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf, then that's a target. If you depend on a dataset called data.csv, that's a target (even if it already exists).

In targets, there are two main types:

files: These are the targets that need to have format = "file" added as an argument to tar_target() and their command must return the filepath(s). We have learned that file targets can be single files, a vector of filepaths, or a directory. USGS Data Science workflows name file targets using their base name and their file extension, e.g. the target for "1_fetch/out/data.csv" would be data_csv. If the file name is really long, you can always simplify it for the target name but it is important to include _[extension] as a suffix. Additionally, USGS Data Science pipelines include the filenames created by file targets as typed-out arguments in the target recipe, or in a comment in the target definition. This practice ensures that you and your colleagues will only have to read the makefile, not the function code, to learn what file is being created.
objects: These are R objects that represent intermediate objects in an analysis. Behind the scenes, these objects are stored to disk so that they persist across R sessions. And unlike typical R objects, they do not exist in your workspace unless you explicitly load them (run tar_load(target_name)).

⌨️ Activity: Assign yourself to this issue to get started.

wdwatkins / ds-pipelines-targets-2 Goto Github PK

ds-pipelines-targets-2's People

Contributors

Watchers

ds-pipelines-targets-2's Issues

I'll sit patiently until you comment

Once your pull request has been merged return here and comment on this issue

I'll sit patiently until your pull request has been merged

A human will interact with your pull request once you assign them as a reviewer

How to make decisions on how many targets to use and how targets are defined

Background

I'll sit patiently until you've assigned yourself to this one.

Recap of pipelines I

What's ahead in pipelines II

I'll sit patiently until you've closed the issue.

Set up your local environment before continuing

Targets

I'll sit patiently until you've assigned yourself to this one.

Recommend Projects

Recommend Topics

Recommend Org