dhopp1 / nowcastlstm Goto Github PK

R wrapper for nowcast_lstm Python library. Long short-term memory neural networks for economic nowcasting.

R 100.00%

machine-learning deep-learning neural-network lstm forecasting nowcasting

nowcastlstm's Introduction

nowcastLSTM

New in v0.2.2: ability to get uncertainty intervals for predictions and predictions on synthetic vintages.

New in v0.2.0: ability to get feature contributions to the model and perform automatic hyperparameter tuning and variable selection, no need to write this outside of the library anymore.

R wrapper for nowcast_lstm Python library. MATLAB and Julia wrappers also exist. Long short-term memory neural networks for economic nowcasting. More background in this paper in the Journal of Official Statistics.

Installation and set up

Installing the library: Install devtools with install.packages("devtools"). Then, from R, run: devtools::install_github("dhopp1/nowcastLSTM"). If you get errors about packages being built on different versions of R, try running Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), then run the install command again. Note on updating the library: This R wrapper is not versioned. When there is a new version of library, update the Python library by running pip install nowcast-lstm==0.2.3 (substitute 0.2.3 with whatever the latest version is) from the command line, then from R rerun devtools::install_github("dhopp1/nowcastLSTM"). This should give you access to the latest functionality in R.

Installing Python: If you already have Python installed on your system, simply follow the install instructions from the nowcast_lstm Python library and point initialize_session to the path where your Python is installed later on.

If you don't have Python installed on your system, run the following commands in R once you've run devtools::install_github("dhopp1/nowcastLSTM"):

library(reticulate)
install_miniconda(path = miniconda_path(), update = TRUE, force = FALSE)
py_install(conda=miniconda_path(), "dill numpy pandas pmdarima torch nowcast-lstm", pip=TRUE)

Example: nowcastLSTM_example.zip contains an R Markdown file with a dataset and more detailed example of usage in R.

Set up

Once all Python libraries are installed, run the initialize_session function in R each time you use the library.

library(nowcastLSTM)

# this function should be run at the beginning of every session. Python path can be left empty to use the system default
initialize_session(python_path = "path_to/python")

# if you installed Python via reticulate, use this. You may get a warning about requesting one path and getting another, but it should work regardless.
initialize_session(python_path = miniconda_path())

# use this to set Python location permanently
Sys.setenv(RETICULATE_PYTHON = "path_to/python")

Background

LSTM neural networks have been used for nowcasting before, combining the strengths of artificial neural networks with a temporal aspect. However their use in nowcasting economic indicators remains limited, no doubt in part due to the difficulty of obtaining results in existing deep learning frameworks. This library seeks to streamline the process of obtaining results in the hopes of expanding the domains to which LSTM can be applied.

While neural networks are flexible and this framework may be able to get sensible results on levels, the model architecture was developed to nowcast growth rates of economic indicators. As such training inputs should ideally be stationary and seasonally adjusted.

Further explanation of the background problem can be found in this UNCTAD research paper. Further explanation and results in this UNCTAD research paper.

Quick usage

Given data = a dataframe with a date column + monthly data + a quarterly target series to run the model on, usage is as follows:

library(nowcastLSTM)
initialize_session()

# this command will instantiate and train an LSTM network
# due to quirks with using Python from R, the python_model_name argument should be set to the same name used for the R object it is assigned to.
model <- LSTM(data, "target_col_name", n_timesteps=12, python_model_name = "model") # default parameters with 12 timestep history
#model <- LSTM(data, "target_col_name", n_timesteps=12, n_models=10, seeds=c(1:10), python_model_name = "model") # For reproducibility on a single machine/system, give a list of manual seeds as long as the n_models parameter. Reproducibility across machines is not guaranteed.


predict(model, data) # predictions on the training set

# predicting on a testset, which is the same dataframe as the training data + newer data
# this will give predictions for all dates, but only predictions after the training data ends should be considered for testing
predict(model, test_data)

# to gauge performance on artificial data vintages
ragged_preds(model, pub_lags, lag, test_data)

# save a trained model
# python_model_name should be the same value used when the model was initially trained
save_lstm(model, "trained_model.pkl", python_model_name = "model")

# load a previously trained model
# due to quirks with using Python from R, the python_model_name argument should be set to the same name used for the R object it is assigned to.
trained_model <- load_lstm("trained_model.pkl", python_model_name = "trained_model")

Model selection

To ease variable and hyperparameter selection, the library provides provisions for this process to be carried out automatically. See the example file or run ? on the functions for more information.

# case where given hyperparameters, want to select which variables go into the model
selected_variables <- variable_selection(data, "target_col_name", n_timesteps=12) # default parameters with 12 timestep history

# case where given variables, want to select hyperparameters
performance <- hyperparameter_tuning(data, "target_col_name", n_timesteps=12, n_hidden_grid=c(10,20))

# case where want to select both variables and hyperparameters for the model
performance <- select_model(data, "target_col_name", n_timesteps=12, n_hidden_grid=c(10,20))

Prediction uncertainty

Produce estimates along with lower and upper bounds of an uncertainty interval. See the example file or run ? on the functions for more information.

interval_preds <- interval_predict(
  model,
  test_data,
  interval = 0.95
)

ragged_interval_preds <- ragged_interval_predict(
  model, 
  pub_lags, 
  lag = 2, 
  data = test_data, 
  interval = 0.95
)

LSTM parameters

data: dataframe of the data to train the model on. Should contain a target column. Any non-numeric columns will be dropped. It should be in the most frequent period of the data. E.g. if I have three monthly variables, two quarterly variables, and a quarterly series, the rows of the dataframe should be months, with the quarterly values appearing every three months (whether Q1 = Jan 1 or Mar 1 depends on the series, but generally the quarterly value should come at the end of the quarter, i.e. Mar 1), with NAs or 0s in between. The same logic applies for yearly variables.
target_variable: a string, the name of the target column in the dataframe.
n_timesteps: an int, corresponding to the "memory" of the network, i.e. the target value depends on the x past values of the independent variables. For example, if the data is monthly, n_timesteps=12 means that the estimated target value is based on the previous years' worth of data, 24 is the last two years', etc. This is a hyper parameter that can be evaluated.
fill_na_func: a function used to replace missing values. Options are c("mean", "median", "ARMA").
fill_ragged_edges_func: a function used to replace missing values at the end of series. Leave blank to use the same function as fill_na_func, pass "ARMA" to use ARMA estimation using pmdarima.arima.auto_arima. Options are c("mean", "median", "ARMA").
n_models: int of the number of networks to train and predict on. Because neural networks are inherently stochastic, it can be useful to train multiple networks with the same hyper parameters and take the average of their outputs as the model's prediction, to smooth output.
train_episodes: int of the number of training episodes/epochs. A short discussion of the topic can be found here.
batch_size: int of the number of observations per batch. Discussed here
decay: float of the rate of decay of the learning rate. Also discussed here. Set to 0 for no decay.
n_hidden: int of the number of hidden states in the LSTM network. Discussed here.
n_layers: int of the number of LSTM layers to include in the network. Also discussed here.
dropout: float of the proportion of layers to drop in between LSTM layers. Discussed here.
criterion: PyTorch loss function. Discussed here, list of available options in PyTorch here. Pass as a string, e.g. one of c("torch.nn.L1Loss()", "torch.nn.MSELoss()"), etc.
optimizer: PyTorch optimizer. Discussed here, list of available options in PyTorch here. Pass as a string, e.g. "torch.optim.Adam".
optimizer_parameters: named list. Parameters for a particular optimizer, including learning rate. Information here. For instance, to change learning rate (default 1e-2), pass list("lr"=1e-2), or weight_decay for L2 regularization, pass list("lr"=1e-2, "weight_decay"=0.001). Learning rate discussed here.

LSTM outputs

Assuming a model has been instantiated and trained with model = LSTM(...), the following functions are available, run help(function) on any of them to find out more about them and their parameters. Other information, like training loss, is available in the trained model object, accessed via $, e.g. model$train_loss:

predict: to generate predictions on new data
save_lstm: to save a trained model to disk
load_lstm: to load a saved model from disk
ragged_preds(model, pub_lags, lag, new_data, start_date, end_date): adds artificial missing data then returns a dataframe with date, actuals, and predictions. This is especially useful as a testing mechanism, to generate datasets to see how a trained model would have performed at different synthetic vintages or periods of time in the past. pub_lags should be a list of ints (in the same order as the columns of the original data) of length n_features (i.e. excluding the target variable) dictating the normal publication lag of each of the variables. lag is an int of how many periods back we want to simulate being, interpretable as last period relative to target period. E.g. if we are nowcasting June, lag = -1 will simulate being in May, where May data is published for variables with a publication lag of 0. It will fill with missings values that wouldn't have been available yet according to the publication lag of the variable + the lag parameter. It will fill missings with the same method specified in the fill_ragged_edges_func parameter in model instantiation.
gen_news(model, target_period, old_data, new_data): Generates news between one data release to another, adding an element of causal inference to the network. Works by holding out new data column by column, recording differences between this prediction and the prediction on full data, and registering this difference as the new data's contribution to the prediction. Contributions are then scaled to equal the actual observed difference in prediction in the aggregate between the old dataset and the new dataset.
model$feature_contribution(): Generates a dataframe showing the relative feature importance of variables in the model using the permutation feature contribution method via RMSE on the train set.

nowcastlstm's People

Contributors

Stargazers

Watchers

Forkers

allisterh ozancanozdemir yangkedc1984 shizelong1985 yfanli epiyy

nowcastlstm's Issues

Single value model predictions

I've been struggling to get the model working on my own sample data, which is a selection of data from World Bank GEM (see attached). Whatever the parameters I choose for the training data and the model, I only get a single value for the entire series. What am I doing wrong?

nowcast_GEM.zip

Failure to update to v0.2.0

Hi Daniel,

I tried updating the nowcastLSTM package to the latest version using the install_github command. However, packageVersion("nowcastLSTM") gives
[1] ‘0.0.0.0’. As a consequence, the new functions like hyperparameter tuning and variable selection are not found.

Specifying the release with install_github("dhopp1/[email protected]") gives the following error
Error in utils::download.file(url, path, method = method, quiet = quiet, : cannot open URL 'https://api.github.com/repos/dhopp1/nowcastLSTM/tarball/v0.2.0'

Updating the Python package through py_install(conda=miniconda_path(), "nowcast-lstm", pip=TRUE) gives Requirement already satisfied: nowcast-lstm in c:\users\julian.slotman\appdata\local\r-miniconda\lib\site-packages (0.2.1).

What to do?

Variable_selection errors

Hi Daniel,

I keep getting the following errors when running the variable_selection, regardless of the configuration of the function parameters (init_test_size, n_fold, etc).
C:\Users\...\AppData\Local\r-miniconda\lib\site-packages\nowcast_lstm\data_setup.py:203: RuntimeWarning: Mean of empty slice rawdata[col] = rawdata[col].fillna(fill_na_func(fill_na_df[col])) multivariate stage: 0 / 179 columns ... multivariate stage: 17 / 179 columns Error: IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

I used to be able to run the function on smaller country data with the following parameters: init_test_size = 0.3, n_fold = 2, initial_ordering = "feature_contribution", pub_lags = -1. Now I'm trying to change the workflow to wider panel data, as you suggested. Do you know what I might be doing wrong?

Error in `[.data.frame`(preds, , date_col) : undefined columns selected

After running through the test model, I've been trying out the LSTM package on my own data. I noticed that the predict function is sensitive to the name of the date column. Is that possible? The date column in my own data was capitalized so it wasn't recognized, resulting in the error above. Changing the column name to 'date' seems to do the trick.

fill_na_func does not work with ARMA

I tried running the fill_na_func with ARMA but I get the following error

 Error in py_run_string_impl(code, local, convert) : 
  TypeError: 'str' object is not callable

ARMA works fine for fill_ragged_edge_func (albeit at a significantly higher run time) so the problem doesn't seem to be related to the call to Python.

Error in system2(command = python, args = shQuote(script), stdout = TRUE, : '"C:/Users/.../AppData/Local/r-miniconda"' not found

Hi Daniel,

I ran into some trouble trying to update the nowcastLSTM package and now I cannot get my code to work again. Specifically, the use_python(python = miniconda_path(), required = T) command gives the error in the title:

Error in system2(command = python, args = shQuote(script), stdout = TRUE,  : 
  '"C:/Users/jrslo/AppData/Local/r-miniconda"' not found

I tried updating R and all the relevant packages but I am afraid this only made matters worse. sessionInfo() shows that version 0.0.0.0000 of the nowcastLSTM package is loaded. I got the following warning WARNING: Rtools is required to build R packages, but is not currently installed. when trying to install the nowcastLSTM package through the install_github command.

I don't know if this is related but I changed the commands to load packages when trying to clean up my code. I wanted to use the librarian package as follows: librarian::shelf(here, tidyverse, dhopp1/nowcastLSTM, beepr, furrr) but this may have had some unintended consequences.

Any idea what I can do?

Errors with 'unknown package' from Github

Hi Daniel,
I have tried to install the library following this:
Installing the library: Install devtools with install.packages("devtools"). Then, from R, run: devtools::install_github("dhopp1/nowcastLSTM"). If you get errors about packages being built on different versions of R, try running Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), then run the install command again

However, I get the following error (for both LTSM and DFM)

devtools::install_github("dhopp1/nowcastLSTM")
Error: Failed to install 'unknown package' from GitHub:
JSON: EXPECTED value GOT <
devtools::install_github("dhopp1/nowcastDFM")
Error: Failed to install 'unknown package' from GitHub:
JSON: EXPECTED value GOT <

How can I overcome this? Thanks in advance!

Error "IndexError: too many indices for array"

Not exactly an issue but more sharing experiences. The following error message in R

C:\Users...\AppData\Local\r-miniconda\lib\site-packages\nowcast_lstm\data_setup.py:203: RuntimeWarning: Mean of empty slice
rawdata[col] = rawdata[col].fillna(fill_na_func(fill_na_df[col]))
Error in py_run_string_impl(code, local, convert) :
IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

relates to empty columns in your training data, which may happen if you run the model on many different countries with varying data availability. Eliminating those columns with the line %>% select(which(colMeans(is.na(.)) < 1)) following the definition of the sample data should help to resolve this issue.

PIP error messages when installing packages

Having never used Python before, I struggled a little to get the packages installed. First, PIP produced errors relating to the path length. I had to follow the steps in https://www.howtogeek.com/266621/how-to-make-windows-10-accept-file-paths-over-260-characters/ to proceed. Then, PIP produced warnings. This article (https://stackoverflow.com/questions/49966547/pip-10-0-1-warning-consider-adding-this-directory-to-path-or) helped me overcome that issue.

In RStudio, the initialize_session() command initially produced errors. Turns out I had to update the Rcpp package (https://stackoverflow.com/questions/68416435/rcpp-package-doesnt-include-rcpp-precious-remove). Now I hope it finally works!

Installation errors

Hi Daniel,

Sorry to come back to this issue again but it keeps on popping op. I tried installing the nowcastLSTM in a clean environment but the installation failed and I got the following errors:
*** arch - i386 Error: package or namespace load failed for 'dplyr' in library.dynam(lib, package, package.lib): DLL 'rlang' not found: maybe not installed for this architecture? Error : package 'dplyr' could not be loaded Error: loading failed Execution halted
Tidyverse and reticulate are already installed and loaded in the new environment. What can I do to overcome this?

Error in py_run_string_impl(code, local, convert) : AttributeError: 'DataFrame' object has no attribute 'NA'

My nowcasting script keeps failing since yesterday, producing the following error when I call the LSTM function

Error in py_run_string_impl(code, local, convert) : AttributeError: 'DataFrame' object has no attribute 'NA'

I suspect it may have something to do with the reticulate package but I don't know how to fix it. I call Python and your package using the following two lines of code

use_python(python = miniconda_path(), required = T) initialize_session(python_path = miniconda_path())

Any idea what's going on here?

Variable selection does not improve RMSE

I've experimented with the variable selection function to reduce the feature space and keep only those features with (hopefully) the most explanatory power. However, I noticed the number of features that is selected by the function is very low and the selected features don't seem to be the most informative. This is confirmed by a more thorough test I ran to compare RMSE on a test set after training the model many times and for multiple countries on 1. only the suggested features or 2. the full set of features. RMSE varied of course but was about twice as high after variable selection. I also don't get anywhere close to the performance calculated by the variable_selection function, even when calculating RMSE on the train data. In fact, RMSE What am I doing wrong in my setup?

Specifically, I take the following steps:

Create train data with cutoff that leaves a few years (~20%) of test data;
Feature engineering on train data (imputation, remove near-zero variation and highly correlated features, centering and scaling);
Find tune variables using variable_selection with identical parameters* as the model (n_timesteps=12, train_episodes = 5, batch_size = 64, decay = 0.995, n_hidden = 5, n_layers = 2, dropout = 0.23, optimizer = "torch.optim.RMSprop", criterion = "torch.nn.MSELoss()") and initial_ordering = "feature_contribution";
Create copy of traindata using only date, target variable and tune variables;
Train model on new traindata;
LSTM predict on testdata (only_actuals_obs = TRUE);
Calculate RMSE on observations in testdata after the train cutoff date

The hyperparameters are tuned mostly for speed, not accuracy.

Error in py_run_string_impl(code, local, convert) : NameError: name 'pd' is not defined

NameError: name 'feature_contribution' is not defined

I'm experimenting with the variable_selection function but I get the following error:
NameError: name 'feature_contribution' is not defined
I've tried "univariate" instead for initial_ordering but I get a similar error. What can I do to solve this?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.