Coder Social home page Coder Social logo

cutoffpredictor's Introduction

CutoffPredictor

Tool for water utilities to monitor and predict customers' risk of service interruption

For details, please read the User Guide

Overview

CutoffPredictor consists of:

  1. Back end
  • A machine learning model is trained periodically, via the following steps:
    • query the utility's database
    • clean the data and prepare features
    • train several machine learning models over a range of parameters
    • select the best-performing model
  • On a monthly or daily basis, the user can make a prediction:
    • query the utility's database (to get recent records)
    • clean the data and prepare features
    • apply the best-performing model parameters from the training phase
  1. Dashboard
  • Plotly/Dash app accessible via a web browser (127.0.0.1:8050)
  • Reads the prediction and displays interactive analytics

Requirements

Accounts:

  • An account to access the utility database
  • Google Maps
  • MapBox

Software requirements:

  • PostgreSQL
  • Flask
  • Java 64-bit JDK, version 11 (required by Python h2o package)
  • Plotly/Dash
  • Python 3.4 or later; with the following packages:
    • pandas
    • numpy
    • scipy
    • math
    • datetime
    • requests
    • psycopg2 (PostgreSQL interface)
    • sklearn (for ML utilities; h2o ML models are favored over sklearn)
    • imblearn (for SMOTE oversampling)
    • h2o (for ML models)
    • shutil (for copying files)
    • argparse
    • os
    • flask
    • plotly
    • dash

Code

Code is stored in the following directory structure:

  • CutoffPredictor.py
    • This is the back-end app
  • CPdashboard.py
    • This is the dashboard app
  • backend/
    • back-end functions
  • config/
    • template for input config file
  • dashboard/
    • dashboard functions
  • Documentation/
    • documentation files

Data

The user must supply CutoffPredictor with a top-level directory for storing data, which we'll call DATA_DIR. CutoffPredictor expects the following data directories to exist under DATA_DIR:

  • data_tables/
    • tables queried from the utility database are stored here as .csv files
  • data_tables_clean/
    • cleaned versions of the database tables
  • feature_tables.train/
    • tables of features computed for the training/testing period
  • predictions.train/
    • tables of predictions and probabilities for the training/testing period
  • saved_models/
    • best-performing models saved here as json files
  • model_perf/
    • model performance statistics
  • feature_tables.pred/
    • tables of features computed for the prediction period
  • predictions.pred/
    • tables of predictions and probabilities for the prediction period
  1. Utility database
  • this is a SQL database (CutoffPredictor uses PostgreSQL)
  1. Configuration file
  • this can be derived from the template under config/

Usage

  1. Back end

     python CutoffPredictor.py config_file >& log_file
    

    where

  • config_file = input config file, derived from the template.
  • log_file = log file to store progress messages
  1. Dashboard

     python CPdashboard.py config_file >& log_file
    

    where

  • config_file = input config file, derived from the template
  • log_file = log file to store progress messages

Both the back end and the dashboard use the same config file.

Recommended Process Flow

A. Update/Retrain Models (monthly or less frequently)

  1. Prepare model inputs (this can be done in a single step, with a single config file).
  • Stages ([STAGES] section of config file):
    • Query database (DOWNLOAD = TRUE)
    • Prepare/clean data (PREP_DATA = TRUE)
    • Prepare features (PREP_FEATURES = TRUE); this will prepare features for a range of values of window length nSamples, controlled by N_SAMPLE_LIST. These feature tables will be saved as csv files in the feature_tables.train subdirectory under DATA_DIR.
    • All other options in the STAGES section should be set to FALSE.
  • Training options ([TRAINING] section of config file):
    • N_SAMPLE_LIST: list of window lengths (nSamples) to consider
  1. Train models (this must be done separately for each desired set of model features)
  • Stages ([STAGES] section of config file):
    • Train models (TRAIN_MODELS = TRUE); this will do a search over all values of nSamples and across all specified model types to find the best-performing model and nSamples to use for predictions. In addition, for random_forest, the optimal value of max_depth will be found.
    • All other options in the STAGES section should be set to FALSE.
  • Training options ([TRAINING] section of config file):
    • REF_DATE: indicates the final date of the training period; all records prior to and including this date will be used in training the models.
    • N_SAMPLE_LIST: list of window lengths (nSamples) to consider
    • MODEL_TYPES: list of model types to explore; currently only logistic_regression and random_forest are supported.
    • MAX_DEPTH_LIST: list of values of max_depth (used in random_forest model) to explore; the minimum value should generally be set to 3 and the maximum should be somewhere between 5 and 20.
    • FEATURES_CUT_PRIOR: indicates whether to include among the feature set a boolean flag signifying whether a customer has had a prior cutoff; this is only possible for utilities that have recorded such information (not all do this); valid values are 'no_cut_prior' and 'with_cut_prior'.
    • FEATURES_METADATA: indicates whether to include among the feature set the three customer metadata variables 'cust_type_code', 'municipality', and 'meter_size'; valid values are 'no_meta' and 'with_meta'.
    • FEATURES_ANOM: indicates which volume anomaly metric to include among the feature set; valid values are 'anom' (use the simple anomaly feature 'f_anom3_vol'), 'manom' (use the monthly anomaly feature 'f_manom3_vol'), and 'none'.

B. Model Predictions (monthly to daily)

This stage can be performed separately for each reference date and feature set over which models were trained in part A.

  • Stages ([STAGES] section of config file):
    • Make prediction (PREDICT = TRUE); this will use the best model/nSamples combination found in part A.2. for the given feature set
    • All other options in the STAGES section should be set to FALSE.
  • Training options ([TRAINING] section of config file):
    • REF_DATE: indicates the final date of the training period; the model and value of nSamples used will be those from stage A.2. for the best-performing model for this reference date and the given feature set.
  • Prediction options ([PREDICTION] section of config file):
    • REF_DATE: indicates the 'current' date (normally this would be the actual current date, but it can be set to a date in the past to compare previous predictions to actual outcomes); predictions will be made based on values of metrics computed from the nSamples months prior to REF_DATE, where nSamples is the best-performing value found in stage A.2. for the given feature set.
    • FEATURES_CUT_PRIOR: this should be set to the value used in part A.2.
    • FEATURES_METADATA: this should be set to the value used in part A.2.
    • FEATURES_ANOM: this should be set to the value used in part A.2.

C. Dashboard (any time)

This stage can be performed separately for each reference date and feature set over which models were trained in part A and for which a prediction was made in part B.

  • Stages ([STAGES] section of config file):
    • All options in the STAGES section will be ignored
  • Prediction options ([PREDICTION] section of config file):
    • REF_DATE: indicates the 'current' date (normally this would be the actual current date, but it can be set to a date in the past to compare previous predictions to actual outcomes); predictions must have been made in stage B for this date and feature set.
    • FEATURES_CUT_PRIOR: this should be set to the value used in part B.
    • FEATURES_METADATA: this should be set to the value used in part B.
    • FEATURES_ANOM: this should be set to the value used in part B.

cutoffpredictor's People

Contributors

tbohn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.