Tool for water utilities to monitor and predict customers' risk of service interruption
For details, please read the User Guide
CutoffPredictor consists of:
- Back end
- A machine learning model is trained periodically, via the following steps:
- query the utility's database
- clean the data and prepare features
- train several machine learning models over a range of parameters
- select the best-performing model
- On a monthly or daily basis, the user can make a prediction:
- query the utility's database (to get recent records)
- clean the data and prepare features
- apply the best-performing model parameters from the training phase
- Dashboard
- Plotly/Dash app accessible via a web browser (127.0.0.1:8050)
- Reads the prediction and displays interactive analytics
Accounts:
- An account to access the utility database
- Google Maps
- MapBox
Software requirements:
- PostgreSQL
- Flask
- Java 64-bit JDK, version 11 (required by Python h2o package)
- Plotly/Dash
- Python 3.4 or later; with the following packages:
- pandas
- numpy
- scipy
- math
- datetime
- requests
- psycopg2 (PostgreSQL interface)
- sklearn (for ML utilities; h2o ML models are favored over sklearn)
- imblearn (for SMOTE oversampling)
- h2o (for ML models)
- shutil (for copying files)
- argparse
- os
- flask
- plotly
- dash
Code is stored in the following directory structure:
- CutoffPredictor.py
- This is the back-end app
- CPdashboard.py
- This is the dashboard app
- backend/
- back-end functions
- config/
- template for input config file
- dashboard/
- dashboard functions
- Documentation/
- documentation files
The user must supply CutoffPredictor with a top-level directory for storing data, which we'll call DATA_DIR
. CutoffPredictor expects the following data directories to exist under DATA_DIR
:
- data_tables/
- tables queried from the utility database are stored here as .csv files
- data_tables_clean/
- cleaned versions of the database tables
- feature_tables.train/
- tables of features computed for the training/testing period
- predictions.train/
- tables of predictions and probabilities for the training/testing period
- saved_models/
- best-performing models saved here as json files
- model_perf/
- model performance statistics
- feature_tables.pred/
- tables of features computed for the prediction period
- predictions.pred/
- tables of predictions and probabilities for the prediction period
- this is a SQL database (CutoffPredictor uses PostgreSQL)
- this can be derived from the template under config/
-
Back end
python CutoffPredictor.py config_file >& log_file
where
- config_file = input config file, derived from the template.
- log_file = log file to store progress messages
-
Dashboard
python CPdashboard.py config_file >& log_file
where
- config_file = input config file, derived from the template
- log_file = log file to store progress messages
Both the back end and the dashboard use the same config file.
- Prepare model inputs (this can be done in a single step, with a single config file).
- Stages (
[STAGES]
section of config file):- Query database (
DOWNLOAD
=TRUE
) - Prepare/clean data (
PREP_DATA
=TRUE
) - Prepare features (
PREP_FEATURES
=TRUE
); this will prepare features for a range of values of window lengthnSamples
, controlled byN_SAMPLE_LIST
. These feature tables will be saved as csv files in thefeature_tables.train
subdirectory underDATA_DIR
. - All other options in the
STAGES
section should be set toFALSE
.
- Query database (
- Training options (
[TRAINING]
section of config file):N_SAMPLE_LIST
: list of window lengths (nSamples
) to consider
- Train models (this must be done separately for each desired set of model features)
- Stages (
[STAGES]
section of config file):- Train models (
TRAIN_MODELS
=TRUE
); this will do a search over all values ofnSamples
and across all specified model types to find the best-performing model andnSamples
to use for predictions. In addition, forrandom_forest
, the optimal value ofmax_depth
will be found. - All other options in the
STAGES
section should be set toFALSE
.
- Train models (
- Training options (
[TRAINING]
section of config file):REF_DATE
: indicates the final date of the training period; all records prior to and including this date will be used in training the models.N_SAMPLE_LIST
: list of window lengths (nSamples
) to considerMODEL_TYPES
: list of model types to explore; currently onlylogistic_regression
andrandom_forest
are supported.MAX_DEPTH_LIST
: list of values ofmax_depth
(used inrandom_forest
model) to explore; the minimum value should generally be set to 3 and the maximum should be somewhere between 5 and 20.FEATURES_CUT_PRIOR
: indicates whether to include among the feature set a boolean flag signifying whether a customer has had a prior cutoff; this is only possible for utilities that have recorded such information (not all do this); valid values are'no_cut_prior'
and'with_cut_prior'
.FEATURES_METADATA
: indicates whether to include among the feature set the three customer metadata variables'cust_type_code'
,'municipality'
, and'meter_size'
; valid values are'no_meta'
and'with_meta'
.FEATURES_ANOM
: indicates which volume anomaly metric to include among the feature set; valid values are'anom'
(use the simple anomaly feature'f_anom3_vol'
),'manom'
(use the monthly anomaly feature'f_manom3_vol'
), and'none'
.
This stage can be performed separately for each reference date and feature set over which models were trained in part A.
- Stages (
[STAGES]
section of config file):- Make prediction (
PREDICT
=TRUE
); this will use the best model/nSamples combination found in part A.2. for the given feature set - All other options in the
STAGES
section should be set toFALSE
.
- Make prediction (
- Training options (
[TRAINING]
section of config file):REF_DATE
: indicates the final date of the training period; the model and value ofnSamples
used will be those from stage A.2. for the best-performing model for this reference date and the given feature set.
- Prediction options (
[PREDICTION]
section of config file):REF_DATE
: indicates the 'current' date (normally this would be the actual current date, but it can be set to a date in the past to compare previous predictions to actual outcomes); predictions will be made based on values of metrics computed from thenSamples
months prior toREF_DATE
, wherenSamples
is the best-performing value found in stage A.2. for the given feature set.FEATURES_CUT_PRIOR
: this should be set to the value used in part A.2.FEATURES_METADATA
: this should be set to the value used in part A.2.FEATURES_ANOM
: this should be set to the value used in part A.2.
This stage can be performed separately for each reference date and feature set over which models were trained in part A and for which a prediction was made in part B.
- Stages (
[STAGES]
section of config file):- All options in the
STAGES
section will be ignored
- All options in the
- Prediction options (
[PREDICTION]
section of config file):REF_DATE
: indicates the 'current' date (normally this would be the actual current date, but it can be set to a date in the past to compare previous predictions to actual outcomes); predictions must have been made in stage B for this date and feature set.FEATURES_CUT_PRIOR
: this should be set to the value used in part B.FEATURES_METADATA
: this should be set to the value used in part B.FEATURES_ANOM
: this should be set to the value used in part B.