mlops-sustainability-oss 🌱

Disclaimer: This repository focuses on realizing adequate processes/pipelines in CPD. From an ML perspective, the trained model itself here is dismissable and will not receive the time investment it needs.

A self-sustaining end-to-end CP4D MLOps workflow for a flood forecasting model with automated data versioning and rollback

Welcome 👋 to this MLOps for sustainability repository.

The Mission: With climate change becoming an increasingly pressing issue, flood risks gain in prevalence. We want to demonstrate how we can leverage data from Copernicus Climate Data Store, and build an MLOps pipeline which is self-sustaining in nature. When training a model once - or only re-training it in terms of hyperparameter tuning - the model's accuarcy will decay leaving a more or less unusable model after a given amount of time. 📉 Once the days for which we predicted flood risks passed and actual data is available, we retrieve the newest data, retrain the model, and benchmark the model against its predecessor, to determine which model to keep and deploy.

Consider this repository a modified version (extended-subset) of our MLOps-CPD repository which is our simplest approach with the most rigorous documentation.

The main differences are:

Using climate data from Copernicus instead of the German Credit Risk Dataset
Self-sustaining approach through a scheduled pipeline system that retrieves the most recent weather data
Use of Open Source Software (OSS) for Model and Dataset Versioning via Data Version Control (DVC)

This MLOps accelerator uses IBM Cloud Object Storage (COS) as remote for DVC (via S3 API), an SKLearn model for training, and Watson Machine Learning (WML) for deployment. Our notebook repertoire is easily modified to leverage a different data store, custom ML models, and other providers for model deployment.

🎥 IBM Employees only: Checkout the recording of our TechFest23 session for this project

⚠️ Disclaimer: This model is currently no where near academic-grade quality. We focused primarily on quickly constructing an MLOps workflow that works with sustainability data and DVC.

We reserve the right to continously fix, improve, and progress this repertoire. See todos for information on upcoming features.

Overview

This subsection will describe our data source, datasets, sub-modules, requirements etc. in more detail.

Data Source and Datasets

We use the Copernicus Data Store to retrieve historic and current climate data. We are collecting the following datasets and variables:

🌍 ERA5-Land hourly data from 1950 to present
- stl1 (Soil Temperature Level 1)
- vswl1 (Volumetric Soil Water Layer 1)
- total_preciptation (Total Precipitation)
🌊 River discharge and related historical data from the Global Floow Awareness System (GloFAS)
- dis24 (averaged daily river discharge in m^3/s)

Since we want to predict flooding risks, we want to predict the time and place where extremely high river discharge occurs. Therefore, all variables gathered from ERA5 are used to predict our predictant dis24. All data coming from the aforementioned datasets come either in NetCDF or NetCDF4 format, which is easily handled in our Notebooks. We make use of Copernicus' "Sub-region extraction" feature to not get data for the whole globe, but rather for a specified sub-region delimited by coordinates for N, W, S, E boundaries. This allows us to easily make different pipelines (and subsequently deploy different models for different regions). In our example, we set Europe as specified region in our pipeline parameters.

All of the aforementioned tasks are handled in our notebooks. For you to recreat the proposed MLOps lifecycle, you will need to create your own Copernicus account to retrieve personalized credentials, since we are not sharing / hard-coding ours for obvious reasons. Retrieve your cdsapi credentials and learn how to use the API here.

Prerequisites on IBM Cloud

In order to use the above asset we need to have access to have an IBM environment with authentication. Your IBM Cloud Account should have access following services:

IBM Watson Studio
IBM Watson Machine Learning (If you are not deploying with a different provider)
IBM Cloud Object Storage (If you are not using a different data store)

Please ascertain you have appropriate access in all the services.

The runs are also governed by the amount of capacity unit hours (CUH) you have access to. If you are running on the free plan please refer to the following links:

Instructions for Project Set-up

General Set-up

For a general tutorial please refer to the regularly maintained instructions in our main MLOps repository. We cannot afford to update each repository subsequent to changes or new features in Watson Studio etc., which is why we refer you to the main documentation.

There you will find instructions on Watson Studio Project and Deployment Spaces, Cloud Object Storage set-up, Pipeline set-up and more.

Where this project deviates

Addendum: Cloud Object Storage Set-up

When creating Cloud Object Storage credentials, you will need to enable HMAC.

Service Credentials > New Credential (Advanced > USE HMAC KEYS ("HMAC": true) and WRITER privilege). We need the access_key_id and secret_access_key from this file.

Add both keys to your credentials.py file if you are running locally, or..

Make sure to make both keys available to the pipeline... e.g.

add them to the MLOPS_COS_CREDENTIALS with all other Cloud Object Storage secrets.

Addendum: Pipeline Set-up

DVC Set-up

DVC is a git-like way to manage large data across systems, and it can connect easily with IBM COS to store and distribute versioned data. This section assumes some familiarity with how to create resources through the cloud.ibm.com dashboard from MLOps-CPD.

NOT covered by Notebooks

In order for DVC to be able to actively track datasets or models, you will need to initialize a new empty repository.

Add the information for your Repository to credentials.py under the GIT_REPOSITORYenvironment variable. It has to have the following format:
https://USERNAME:[email protected]/USERNAME/REPOSITORY-NAME

Note: DVC will not store your dataset and model but placeholders to track data files and directories. Additionally it will contain your DVC configuration file, which in turn contains your remote (URL, Endpoint, unhashed Access Secrets).

Covered by Notebook

Within notebook a1_init_dvc_and_track_data.ipynb

the previouosly created Git Repository is cloned into the temporary filesystem of the CPDaaS Jupyter Runtime.
DVC is initialized (dvc init)
A Cloud Object Storage instance is added to DVC as a remote (using the credentials passed via WS Pipeline Parameters) and subsequently committed to the DVC Git Repository via git commitand dvc push.
Create folder structure e.g. /data, /model
The full pickle binary of the dataset is added dvc add, git commit, ```dvc push````
- Metadata is pushed to repository
- Binary is uploaded to COS Bucket

Note: All steps will be repeated by pipeline as a consequence of having the pipeline run according to a schedule. That is of no concern, since redundant cells will be skipped.

iiias / mlops-sustainability-oss Goto Github PK