Coder Social home page Coder Social logo

bike_predict's Introduction

Bike Predict End-to-End Machine Learning Pipeline

This repository contains an example of using Posit Connect, pins, and vetiver to create an end-to-end machine learning pipeline.

Who This is For

Both data scientists and R admins in machine-learning-heavy contexts may find this demo interesting. People who describe production-izing or deploying content as pain points may find this helpful.

Some particular pain points this could address:

I am trying to deploy/productionize a machine learning model

People mean MANY different things by "productionize" a machine learning model. Very often, that means making the output of a model available to another process. The most common paths to making model output accessible to other tools are writing to a database, writing to a flat file (or pin), or providing real-time predictions with a plumber API.

This repository contains examples of all three of these patterns. The model metrics script outputs the test data, including predictions, to a database and outputs model performance metrics to a pin. It would be easy to make either of these the final consumable for another process, like the shiny client app. The shiny app in this repository uses a plumber API serving predictions.

Another common problem in deploying the model is figuring out where the model lives. In this example, the model(s) is(are) pinned to Posit Connect and are consumed by the other assets (test data script and plumber API) from the pin.

For relatively advanced model deployments, users may be interested in horseracing different models, A/B testing one model from another, or monitoring model performance and drift over time. Once finished, the model performance dashboard will be a tool to compare different models and examine model performance over time.

Another piece embedded in the background of deploying/productionizing a model is making the entire pipeline robust to (for example) someone accidentally pushing the deploy button when they shouldn't. A perfect solution to this is programmatic deployment. This entire repository is deployed from a GitHub repo, using the functionality in Posit Connect. One cool problem this can solve is deploying dev versions of content, which can easily be accomplished using a long-running deployed dev branch. There's an example of this in the Dev Client App.

Another piece of this is making the underlying R functions more robust. See the next point for more on that.

I have a bunch of functions I need to use, but it's a pain

Most R users know that the correct solution is to put their R functions into a package if they are reused -- or even if they need to be well-documented and tested. This repository includes a package of helper functions that do various tasks.

Many R users aren't sure how to deploy their package. Their packages work well locally, but everything breaks when they try to deploy. This is a great use case for Posit Package Manager. Posit Package Manager makes it easy to create a package that contains the code needed in an app, push that code up to git, and have it available via install.packages to a deployment environment (like Posit Connect) that might need it.

For more details, see the Posit Package Manager page and https://solutions.posit.co/envs-pkgs/environments/.

I have a bunch of CSV files I use in my shiny app

For some workflows, a CSV file is the best choice for storing data. However, for many (most?) cases, the data would do better if stored somewhere centrally accessible by multiple people where the latest version is always available. This is particularly true if that data is reused across multiple projects or pieces of content.

This project has two data files that are particularly pin-able -- the station metadata file (that maps station IDs to names and locations) and the data frame of out-of-sample error metrics for each model. Both are relatively small files, reused by multiple assets, where only the newest version is needed -- perfect candidates for a pin.

A few other non-dataset objects are also perfect for a pin: the models themselves and the test/training split. These have similar properties to the datasets -- small, reused, and only the newest is needed -- and are serializable by R, making them excellent choices for a pin.

Some examples of objects that are likely to be a good fit for a pin:

  • machine-learning models
  • plotting data that is updated on a schedule (as opposed to created on demand)
  • data splits/training data sets
  • metadata files from some machine-readable ID to human-readable details

I've got this CRON job that does some ETL/data processing/creates a bunch of files

Scheduled R Markdown isn't always the best solution here (for example, robust SQL pipelines in another tool don't need to be replaced with scheduled RMarkdown), but if the user is running R code, scheduled R Markdown is way easier than anything else.

What doesn't it do

This repository shows an exciting set of capabilities, combining open-source R and Python with Posit's professional products. There are a few things it doesn't do (yet) -- but that I might add, depending on interest:

  • Jobs don't depend on another. I've scheduled the jobs so that each will complete by the time another starts, but there are tools in R (like drake) that allow you to put the entire pipeline into code and make dependencies explicit.
  • Pieces of content must be managed individually, including uploading, permissions, environment variables, and tagging. It is possible to do something more robust via programmatic deployment using the Posit Connect API, but generic git deployment doesn't support deploying all of the content in a git repo at once.

Individual Content

Content Description Code Content Deployed to Connect
ETL Step 01 - Raw Data Refresh Get the latest station status data from the https://capitalbikeshare.com API. The data is written to Content DB in table bike_raw_data and bike_station_info. content/01-etl/01-raw-data-refresh/document.qmd Quarto document, Pin (bike_station_info)
ETL Step 2 - Tidy data From Content DB get two tables: (1) bike_raw_data and (2) bike_station_info. The two data sets are tidied and then combined. The resulting tidy data set is written to Content DB in table bike_model_data. content/01-etl/02-tidy-data/document.qmd Quarto document
Model Step 1 - Train and Deploy Model From Content DB get the bike_model_data table and then train a model. The model is saved to Connect as a pin, and then deployed to Connect as a plumber API using vetiver. content/02-model/01-train-and-deploy-model/document.qmd Quarto document, Pin, Plumber API
Model Step 2 - Model Card Use the vetiver model card template to document essential facts and considerations of the deployed model. content/02-model/03-model-card/document.qmd Quarto document
Model Step 3 - Model Metrics Use vetiver to document the model performance. Model performance metrics are calculated and then written to pin using vetiver. content/02-model/02-model-metrics/document.qmd Quarto document, Pin
App - Client App Use the API endpoint to interactively server predictions to a shiny app. content/03-app/01-client-app/app.R Shiny app
App - Dev Client App A development version of the client app. content/03-app/03-client-app-dev/app.R Shiny app
App - Content Dashboard A dashboard that contains links to all of the bike predict content. content/03-app/02-connect-widgets-app/document.qmd Quarto document

Contributing

See a problem or want to contribute? Please refer to the contributing page.

bike_predict's People

Contributors

akgold avatar edavidaja avatar gsingh91 avatar kellobri avatar kmasiello avatar npelikan avatar samedwardes avatar xuf12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bike_predict's Issues

ETL Step 01 - Update the Database is failing

For some reason the first ETL job is failing on Connect. It is working on Workbench. Here are the logs:

05/06 18:41:10.973 (GMT)
Data access failure: Error in loadNamespace(x): there is no package called 'httr'
05/06 18:41:10.973 (GMT)
Quitting from lines 28-53 (update_db.qmd)
05/06 18:41:10.974 (GMT)
Error in UseMethod("mutate") :
05/06 18:41:10.974 (GMT)
no applicable method for 'mutate' applied to an object of class "NULL"
05/06 18:41:10.974 (GMT)
Calls: local ... withVisible -> eval -> eval -> %>% -> <Anonymous> -> <Anonymous>
05/06 18:41:10.974 (GMT)
In addition: Warning message:
05/06 18:41:10.974 (GMT)
In system("whoami", intern = TRUE) : running command 'whoami' had status 1
05/06 18:41:10.977 (GMT)
Execution halted

The issue looks to be around here:

feeds_data <-
bikeHelpR::feeds_urls() %>%
filter(name == "station_status") %>%
pull(url) %>%
bikeHelpR::get_data()

For some reason this is returning NULL.

Integration of {targets}

This issue is to keep track of the integration of the {targets} package to the different components of the pipeline in this repository.

Rename files in bike_predict to include number_prefix

@akgold I am considering renaming the files in bike predict to be a bit easier to follow (in my opinion). For example, current file tree looks like this:

├── bike_predict
│   ├── API
│   │   ├── manifest.json
│   │   ├── plumber.R
│   │   └── rsconnect
│   ├── App
│   │   ├── client_app
│   │   └── model_performance
│   ├── bike_predict.Rproj
│   ├── Deploy
│   │   ├── deploy.Rmd
│   │   ├── Deploy.Rproj
│   │   ├── renv
│   │   ├── renv.lock
│   │   └── rsconnect
│   ├── EDA
│   │   ├── EDA.Rmd
│   │   ├── model.RDS
│   │   └── renv.lock
....

I propose to rename things to look like this:

├── README.md
├── content
│   ├── 01-etl-raw-data-ingest
│   │   ├── ETL_clean_raw.qmd
│   │   └── manifest.json
│   ├── 02-etl-clean-data
│   │   ├── ETL_clean_raw.qmd
│   │   └── manifest.json

And so on...

I think this will make it a bit easier to follow. Right now it is hard to see how things fit together unless you look at the raw README file.

Just wanted your buy in before making any changes.

Colorado is not building quarto docs

Thank you at @jthomasmock for discovering this issue (#15 (comment))! The https://github.com/sol-eng/bike_predict/blob/main/_write_manifest.qmd script is not correctly writing the manifest as a quarto doc. Instead it is identifying as an RMarkdown doc.

To fix this we need to updating rsconnect to v0.8.26. and for each deployed piece of content we need to change this:

rsconnect::writeManifest(
  appDir = "content/01-etl/01-raw-data-refresh", 
  appFiles = "document.qmd"
)
print("Complete 🎉")

To this:

rsconnect::writeManifest(
  appDir = "content/01-etl/01-raw-data-refresh", 
  appPrimaryDoc = "document.qmd",
  quarto = quarto::quarto_path()
)

#25 has been drafted to fix this, however it will not work until RSC on Coloardo is updated to 2022.05.0.

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Failed runs after migrating to us-east-2 and colorado.posit.co

├── content
│   ├── 01-etl
│   │   ├── 01-raw-data-refresh
│   │   └── 02-tidy-data
│   ├── 02-model
│   │   ├── 01-train-and-deploy-model
│   │   ├── 02-model-card
│   │   └── 03-model-metrics
│   └── 03-app
│       ├── 01-client-app
│       ├── 02-connect-widgets-app
│       └── 03-client-app-dev

content/01-etl/01-raw-data-refresh

  • Confirmed working, no issues.

content/01-etl/

  • Confirmed working, no issues.

content/02-model/01-train-and-deploy-model

  • Fixed issue where code was pointing colorado.rstudio.com instead of colorado.posit.co

content/02-model/02-model-card

  • Failed several times, but working as of 2023-05-01 it is working
  • Failed run on April 24:
2023/04/24 1:30:39 AM: [rsc-session] Content GUID: db334c42-9c7c-4102-8127-71c2ea82fba6
2023/04/24 1:30:39 AM: [rsc-session] Content ID: 11821
2023/04/24 1:30:39 AM: [rsc-session] Bundle ID: 58430
2023/04/24 1:30:39 AM: [rsc-session] Variant ID: 5493
2023/04/24 1:30:39 AM: [rsc-quarto] Running on host: render-quarto-project-6rgld-qj2cb
2023/04/24 1:30:39 AM: [rsc-quarto] Linux distribution: Ubuntu 18.04.6 LTS (bionic)
2023/04/24 1:30:39 AM: [rsc-quarto] Running as user: uid=999 gid=999 groups=999
2023/04/24 1:30:39 AM: [rsc-quarto] Connect version: 2023.03.0
2023/04/24 1:30:39 AM: [rsc-quarto] LANG: en_US.UTF-8
2023/04/24 1:30:39 AM: [rsc-quarto] Working directory: /opt/rstudio-connect/mnt/app
2023/04/24 1:30:39 AM: [rsc-quarto] Using R binary: /opt/R/4.1.3/bin/R
2023/04/24 1:30:40 AM: [rsc-quarto] Configuring PATH='/opt/R/4.1.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
2023/04/24 1:30:40 AM: [rsc-quarto] Configuring R_HOME=/opt/R/4.1.3/lib/R
2023/04/24 1:30:40 AM: [rsc-quarto] Running content using its packrat R library
2023/04/24 1:30:42 AM: [rsc-quarto] Configuring R_LIBS='/opt/rstudio-connect/mnt/app/packrat/lib/x86_64-pc-linux-gnu/4.1.3'
2023/04/24 1:30:44 AM: [rsc-quarto] Running 'quarto render'
2023/04/24 1:30:50 AM:
2023/04/24 1:30:50 AM:
2023/04/24 1:30:50 AM: processing file: document.qmd
2023/04/24 1:30:50 AM: | | | 0% | |.... | 6%
2023/04/24 1:30:50 AM: inline R code fragments
2023/04/24 1:30:50 AM:
2023/04/24 1:30:50 AM: | |........ | 12%
2023/04/24 1:30:50 AM: label: setup (with options)
2023/04/24 1:30:50 AM: List of 4
2023/04/24 1:30:50 AM: $ collapse : logi TRUE
2023/04/24 1:30:50 AM: $ original.params.src: chr "r setup"
2023/04/24 1:30:50 AM: $ chunk.echo : logi FALSE
2023/04/24 1:30:50 AM: $ yaml.code : chr [1:2] "#| collapse: true" ""
2023/04/24 1:30:50 AM:
2023/04/24 1:31:04 AM: Quitting from lines 20-59 (document.qmd)
2023/04/24 1:31:04 AM:
2023/04/24 1:31:04 AM: Error in new_result(connection@ptr, statement, immediate) :
2023/04/24 1:31:04 AM: nanodbc/nanodbc.cpp:1412: 00000: [RStudio][PostgreSQL] (30) Error occurred while trying to execute a query: [SQLState 42P01] ERROR: relation "bike_model_data" does not exist
2023/04/24 1:31:04 AM: LINE 2: FROM "bike_model_data" AS "q01"
2023/04/24 1:31:04 AM: ^
2023/04/24 1:31:04 AM:
2023/04/24 1:31:04 AM: Calls: .main ... dbSendQuery -> dbSendQuery -> .local -> OdbcResult -> new_result
2023/04/24 1:31:04 AM: Execution halted

content/02-model/03-model-metrics

  • Reran 2023-05-02 with no issues.
  • Failed logs on 2023-05-01:
2023/05/01 08:31:18.474444615 [rsc-session] Content GUID: e2c4d2ce-8ad7-4e10-9e57-4f9a07677141
2023/05/01 08:31:18.475026367 [rsc-session] Content ID: 11822
2023/05/01 08:31:18.475045229 [rsc-session] Bundle ID: 58431
2023/05/01 08:31:18.475050805 [rsc-session] Variant ID: 5494
2023/05/01 08:31:18.480162638 [rsc-quarto] Running on host: render-quarto-project-nppx6-99ldz
2023/05/01 08:31:18.563318390 [rsc-quarto] Linux distribution: Ubuntu 18.04.6 LTS (bionic)
2023/05/01 08:31:18.572971520 [rsc-quarto] Running as user: uid=999 gid=999 groups=999
2023/05/01 08:31:18.572988608 [rsc-quarto] Connect version: 2023.03.0
2023/05/01 08:31:18.573429776 [rsc-quarto] LANG: en_US.UTF-8
2023/05/01 08:31:18.573433836 [rsc-quarto] Working directory: /opt/rstudio-connect/mnt/app
2023/05/01 08:31:18.573589277 [rsc-quarto] Using R binary: /opt/R/4.1.3/bin/R
2023/05/01 08:31:19.114913689 [rsc-quarto] Configuring PATH='/opt/R/4.1.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
2023/05/01 08:31:19.143705950 [rsc-quarto] Configuring R_HOME=/opt/R/4.1.3/lib/R
2023/05/01 08:31:19.143725143 [rsc-quarto] Running content using its packrat R library
2023/05/01 08:31:19.673271295 [rsc-quarto] Configuring R_LIBS='/opt/rstudio-connect/mnt/app/packrat/lib/x86_64-pc-linux-gnu/4.1.3'
2023/05/01 08:31:21.245058748 [rsc-quarto] Running 'quarto render'
2023/05/01 08:31:23.951223256 
2023/05/01 08:31:23.951240641 
2023/05/01 08:31:23.951295405 processing file: document.qmd
2023/05/01 08:31:24.254754441 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..........                                                            |  14%
2023/05/01 08:31:24.254788725    inline R code fragments
2023/05/01 08:31:24.254845346 
2023/05/01 08:31:24.524199808 
  |                                                                            
  |....................                                                  |  29%
2023/05/01 08:31:24.525757327 label: setup (with options) 
2023/05/01 08:31:24.529604405 List of 4
2023/05/01 08:31:24.531882291  $ collapse           : logi TRUE
2023/05/01 08:31:24.532732079  $ original.params.src: chr "r setup"
2023/05/01 08:31:24.533295706  $ chunk.echo         : logi FALSE
2023/05/01 08:31:24.533893152  $ yaml.code          : chr [1:2] "#| collapse: true" ""
2023/05/01 08:31:24.534206258 
2023/05/01 08:31:33.718601785 Quitting from lines 28-60 (document.qmd) 
2023/05/01 08:31:33.718983204 
2023/05/01 08:31:33.719291463 Error in new_result(connection@ptr, statement, immediate) : 
2023/05/01 08:31:33.719298690   nanodbc/nanodbc.cpp:1412: 00000: [RStudio][PostgreSQL] (30) Error occurred while trying to execute a query: [SQLState 42P01] ERROR:  relation "bike_model_data" does not exist
2023/05/01 08:31:33.719329512 LINE 2: FROM "bike_model_data" AS "q01"
2023/05/01 08:31:33.719330845              ^
2023/05/01 08:31:33.719336648  
2023/05/01 08:31:33.719337354 Calls: .main ... dbSendQuery -> dbSendQuery -> .local -> OdbcResult -> new_result
2023/05/01 08:31:33.720654859 Execution halted

content/03-app/01-client-app

  • Fixed issue where code was pointing colorado.rstudio.com instead of colorado.posit.co

content/03-app/02-connect-widgets-app

  • No issues

content/03-app/03-client-app-dev

  • Fixed issue where code was pointing colorado.rstudio.com instead of colorado.posit.co

Other

  • Regenerated all of the manifests to point to PPM on colorado.posit.co

Suggestions for tidymodels + vetiver code

Hello! 👋 This demo is looking so great; thank you for creating this to show folks how to use our tools. I have a couple of minor suggestions for tidymodels and vetiver code in 01-train-and-deploy-model:

Can you use initial_time_split() here, instead of the manual splitting? Then you can use training() and testing() from tidymodels:

n_days_test <- 2
n_days_to_train <- 10
train_end_date <- dates[n_days_test + 1]

I don't believe you need to manually save the ptype here (or save versioned = TRUE). This should be grabbed automatically from the model_fit:

versioned = TRUE,
save_ptype = train_data %>%
head(1) %>%
select(-n_bikes),

If you'd like a PR for either of these, I would be happy to do it!

Update RSPM to new bikeHelpR location

If #15 is merged we may need to update public package manager and colorado package manager to point to the new location of the bikeHelpR package.

Old location:

.
├── bike_predict.Rproj
├── pkg

New location:

.
├── bike_predict.Rproj
├── content
│   ├── 00-r-package
│   │   └── bikeHelpR

Tasks

  • Update public package manager
  • Updated colorado package manager

DB connection expiry

I'm not sure how DB connections are managed, but I happened upon this app on Colorado the other day with an expired database connection - I had to poke the environment variables to reset the R processes. Might be worth figuring out if this is a bug upstream / if there is an easy way to reproduce / if there is a way we can be defensive about this in the app.

It made the demo a little unfortunate to happen upon the app when it was broken.

Re-brand images to Posit

The images in this repo are still branded as RStudio, not Posit. I think it might be helpful to update them with the new branding, particularly arrows.png as it is very prominent in the README.

Diagram arrows are covering up text in some areas

Love all the great updates to Bike Prediction, thanks for doing this Sam!

I did want to note that for the diagram on ETL/Model Training w/ Quarto - the arrow is covering up some of the text.

I also wasn't sure if there was a missing arrow for the database connection forward?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.