wdwatkins / ds-pipelines-targets-2 Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://lab.github.com/USGS-R/usgs-targets-tips-and-tricks
Home Page: https://lab.github.com/USGS-R/usgs-targets-tips-and-tricks
In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:
tar_visnetwork()
and tar_outdated()
to further interrogate the status of pipeline targetsβ¨οΈ add a comment to this issue and the bot will respond with the next topic
So far youβve learned a lot about the mechanics of using targets, but there are also a few conventions that USGS Data Science practitioners use to maintain consistency across projects. These conventions make it easier to jump into a new project, provide peer review, or learn a new technique from someone elseβs pipeline, since you are already familiar with the structure.
As you learned in the first pipelines course, we like to separate pipelines into distinct phases based on what is happening to the data (we usually use fetch
, process
, visualize
, etc). So far in this course, we have been using a single list of targets in the _targets.R
makefile. This works for short pipelines, but when you have bigger, more complex pipelines, that file and target list could get HUGE and difficult to read.
For this reason, we like to have multiple makefiles that each describe and are named after a single phase, e.g., 1_fetch.R
or 4_visualize.R
. Within each phase makefile, targets are saved in an R list object which is numbered based on the phase, e.g. p1_targets_list
or p4_targets_list
. Then, the main _targets.R
makefile sources each of these phase makefiles and combines the target lists into a single list using c()
, e.g., c(p1_targets_list, p2_targets_list, p3_targets_list, p4_targets_list
.
In addition to this multi-makefile approach, we also like to name our targets to make it clear which phase they belong to. For example, any target created in the fetch
phase would be prefixed with p1
. We do this for two reasons: 1) it is clearer, and 2) you can now use dplyr::select
syntax to build all targets in a single phase by running tar_make(starts_with("p1"))
. A handy little trick!
Consider the two-phased pipeline below, where you need to download data from ScienceBase and then combine it all into a single dataframe.
If the 1_fetch.R
makefile looked like this
p1_targets_list <- list(
tar_target(
p1_sb_files,
{
dummy <- '2021-04-19'
item_file_download(sb_id = "4f4e4acae4b07f02db67d22b",
dest_dir = "1_fetch/tmp",
overwrite_file = TRUE)
},
format = "file",
packages = "sbtools"
)
)
and the 2_process.R
makefile looked like this
source("2_process/src/combine_files.R")
p2_targets_list <- list(
tar_target(
p2_plot_data,
combine_into_df(sb_files)
)
)
then the canonical_targets.R
makefile would look like this
library(targets)
source("1_fetch.R")
source("2_process.R")
# Return the complete list of targets
c(p1_targets_list, p2_targets_list)
You could then build the full pipeline by running tar_make()
, or run specific phases using tar_make(starts_with("p1"))
for the fetch phase and tar_make(starts_with("p2"))
for the process phase.
β¨οΈ Activity: Split your pipeline targets into the phases fetch, process, and visualize. Use a different makefile for each phase and follow our phase-naming conventions to name the makefiles and list objects. Also, rename your targets using the appropriate prefix (p1_
, p2_
, p3_
). Run tar_make()
and open a pull request. Paste your build status as a comment to the PR and assign your designated course instructor as a reviewer.
You should now have a working pipeline that can run with tar_make()
. Your current pipeline likely only has one file target, which is the final plot.
We want you to get used to exchanging objects for files and vice versa, in order to expose some of the important differences that show up in the makefile and also in the way the functions are put together.
β¨οΈ Activity: Open a PR where you swap two object targets to be file targets, and change one file target to be an object target. Run tar_make
and open a pull request. Paste your build status as a comment to the PR and assign your designated course instructor as a reviewer.
β¨οΈ Activity: Make modifications to the working, but less than ideal, pipeline that exists within your course repository
Within the course repo you should see only a _targets.R
and directories with code or placeholder files for each phase. You should be able to run tar_make()
and build the pipeline, although it may take numerous tries, since some parts of this new workflow are brittle. Some hints to get you started: the site_data
target is too big, and you should consider splitting it into a target for each site, perhaps using the download_nwis_site_data()
function directly to write a file. Several of the site_data_
targets are too small and it might make sense to combine them.
When you are happy with your newer, better workflow, create a pull request with your changes and assign your designated course instructor as a reviewer. Add a comment to your own PR with thoughts on how you approached the task, as well as key decisions you made.
You should create a local branch called "refactor-targets" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "refactor-targets" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data"
or "fix-issue-17"
.
git checkout -b refactor-targets
git push -u origin refactor-targets
We've covered a lot of content about the rules of writing good pipelines, but pipelines are also very flexible! Pipelines can have as many or as few targets as you would like, and targets can be as big or as small as you would like. The key theme for all pipelines is that they are reproducible codebases to document your data analysis process for both humans and machines. In this next section, we will learn about how to make decisions related to the number and types of targets you add to a pipeline.
Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used in complex, collaborative modeling projects, such as in this reservoir temperature modeling pipeline and in this pipeline to manage downloads of forecasted meteorological driver data. Note that you need to be able to access internal USGS websites to see these examples and these were developed early on in the Data Science adoption of targets
so may not showcase all of our adopted best practices.
Here is a much simpler example that was used to generate Figure 1 from Water quality data for nationalβscale aquatic research: The Water Quality Portal (published in 2017):
library(targets)
## All R files that are used must be listed here:
source("R/get_mutate_HUC8s.R")
source("R/get_wqp_data.R")
source("R/plot_huc_panel.R")
tar_option_set(packages = c("dataRetrieval", "dplyr", "httr", "lubridate", "maps",
"maptools", "RColorBrewer", "rgeos", "rgdal", "sp", "yaml"))
# Load configuration files
p0_targets_list <- list(
tar_target(map_config_yml, "configs/mapping.yml", format = "file"),
tar_target(map_config, yaml.load_file(map_config_yml)),
tar_target(wqp_config_yml, "configs/wqp_params.yml", format = "file")
tar_target(wqp_config, yaml.load_file(wqp_config_yml))
)
# Fetch data
p1_targets_list <- list(
tar_target(huc_map, get_mutate_HUC8s(map_config)),
tar_target(phosphorus_lakes, get_wqp_data("phosphorus_lakes", wqp_config, map_config)),
tar_target(phosphorus_all, get_wqp_data("phosphorus_all", wqp_config, map_config)),
tar_target(nitrogen_lakes, get_wqp_data("nitrogen_lakes", wqp_config, map_config)),
tar_target(nitrogen_all, get_wqp_data("nitrogen_all", wqp_config, map_config)),
tar_target(arsenic_lakes, get_wqp_data("arsenic_lakes", wqp_config, map_config)),
tar_target(arsenic_all, get_wqp_data("arsenic_all", wqp_config, map_config)),
tar_target(temperature_lakes, get_wqp_data("temperature_lakes", wqp_config, map_config)),
tar_target(temperature_all, get_wqp_data("temperature_all", wqp_config, map_config)),
tar_target(secchi_lakes, get_wqp_data("secchi_lakes", wqp_config, map_config)),
tar_target(secchi_all, get_wqp_data("secchi_all", wqp_config, map_config)),
)
# Summarize the data in a plot
p2_targets_list <- list(
tar_target(
multi_panel_constituents_png,
plot_huc_panel(
"figures/multi_panel_constituents.png", huc_map, map_config,
arsenic_lakes, arsenic_all, nitrogen_lakes, nitrogen_all,
phosphorus_lakes, phosphorus_all, secchi_lakes, secchi_all,
temperature_lakes, temperature_all
),
format = "file")
)
# Combine all targets into a single list
c(p0_targets_list, p1_targets_list, p2_targets_list)
This makefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:
The "figures/multi_panel_constituents.png"
figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data
uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all
is a target that contains the locations of all of the sites that have nitrogen water quality data samples).
Alternatively, the map_config*
object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s
function, and includes some plotting details for the final map (such as plotting color divisions).
This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.
β¨οΈ Activity: Assign yourself to this issue to get started.
Welcome to the second installment of "introduction to data pipelines" at USGS, @wdwatkins!! β¨
We're assuming you were able to navigate through the intro-to-targets-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the makefile: a way to program connections between functions, files, and phases and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.
First, a recap of key concepts that came from intro-to-targets-pipelines π
make
, and you were asked to experiment with targets
.This last concept was not addressed directly, but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).
In this training, the focus will be on conventions and best practices for making better, smarter pipelines for USGS Data Science projects. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!
β¨οΈ Activity: Add collaborators and close this issue to get started.
As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the βοΈ Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite your course instructor. It should look something like this:
π‘ Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I β³ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my human (your designated course instructor) know if I seem to have become completely stuck.
During the course, we will ask you to build the pipeline, explore how to troubleshoot, and implement some of the best practices your are learning. To do this, you will work with the pipeline locally and commit/push your changes to GitHub for review.
See details below for how to get started working with code and files that exist within the course repsository:
Open a git bash shell (Windows:diamond_shape_with_a_dot_inside:) or a terminal window (Mac:green_apple:) and change (cd
) into the directory you work in for projects in R (for me, this is ~/Documents/R
). There, clone the repository and set your working directory to the new project folder that was created:
git clone [email protected]:wdwatkins/ds-pipelines-targets-2.git
cd ds-pipelines-targets-2
You can also open this project in RStudio by double-clicking the .Rproj file in the ds-pipelines-targets-2
directory.
β Now you have the repository locally! Follow along with the commands introduced and make changes to the code as requested throughout the remainder of the course.
β¨οΈ close this issue to continue!
You are awesome, @wdwatkins! π π₯ π
We hope you've learned a lot in intro to pipelines II. We don't have additional exercises in this module, but we'd love to have a discussion if you have questions.
As a resource for later, here are links to the content you just completed
tar_visnetwork()
and tar_outdated()
to further interrogate the status of pipeline targets.gitignore
and commenting pipeline codeβ¨οΈ You have now completed the course π If you have comments or questions, add them below and then assign a course lead this issue to engage in dialogue.
"Targets" are the main things that the targets
package interacts with (if the name hadn't already implied that π€ͺ). They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf
, then that's a target. If you depend on a dataset called data.csv
, that's a target (even if it already exists).
In targets
, there are two main types:
format = "file"
added as an argument to tar_target()
and their command must return the filepath(s). We have learned that file targets can be single files, a vector of filepaths, or a directory. USGS Data Science workflows name file targets using their base name and their file extension, e.g. the target for "1_fetch/out/data.csv"
would be data_csv
. If the file name is really long, you can always simplify it for the target name but it is important to include _[extension]
as a suffix. Additionally, USGS Data Science pipelines include the filenames created by file targets as typed-out arguments in the target recipe, or in a comment in the target definition. This practice ensures that you and your colleagues will only have to read the makefile, not the function code, to learn what file is being created.tar_load(target_name)
).β¨οΈ Activity: Assign yourself to this issue to get started.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.