Coder Social home page Coder Social logo

carlomazzaferro / scikit-hts Goto Github PK

View Code? Open in Web Editor NEW
218.0 16.0 38.0 272 KB

Hierarchical Time Series Forecasting with a familiar API

License: MIT License

Python 98.25% Shell 0.11% Makefile 1.64%
time-series machine-learning hierarchical-data fbprophet statsmodels exponential-smoothing time-series-analysis scikit-learn

scikit-hts's Introduction

NOTE: I unfortunately do not have time anymore to dedicate to this project, contributions are welcome.

scikit-hts

Hierarchical Time Series with a familiar API. This is the result from not having found any good implementations of HTS on-line, and my work in the mobility space while working at Circ (acquired by Bird scooters).

My work on this is purely out of passion, so contributions are always welcomed. You can also buy me a coffee if you'd like:

ETH / BSC Address: 0xbF42b9c8F7B69D52b8b986AA4E0BAc6838Af6698

image

image

Documentation Status

Coverage

Downloads/Month

Slack

Overview

Building on the excellent work by Hyndman1, we developed this package in order to provide a python implementation of general hierarchical time series modeling.

Note

STATUS: alpha. Active development, but breaking changes may come.

Features

  • Supported and tested on python 3.6, python 3.7 and python 3.8
  • Implementation of Bottom-Up, Top-Down, Middle-Out, Forecast Proportions, Average Historic Proportions, Proportions of Historic Averages and OLS revision methods
  • Support for representations of hierarchical and grouped time series
  • Support for a variety of underlying forecasting models, inlcuding: SARIMAX, ARIMA, Prophet, Holt-Winters
  • Scikit-learn-like API
  • Geo events handling functionality for geospatial data, including visualisation capabilities
  • Static typing for a nice developer experience
  • Distributed training & Dask integration: perform training and prediction in parallel or in a cluster with Dask

Examples

You can find code usages here: https://github.com/carlomazzaferro/scikit-hts-examples

Roadmap

  • More flexible underlying modeling support
    • [P] AR, ARIMAX, VARMAX, etc
    • [P] Bring-Your-Own-Model
    • [P] Different parameters for each of the models
  • Decoupling reconciliation methods from forecast fitting
    • [W] Enable to use the reconciliation methods with pre-fitted models
P: Planned
W: WIP

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.


  1. Forecasting Principles and Practice. Rob J Hyndman and George Athanasopoulos. Monash University, Australia.

scikit-hts's People

Contributors

aakashparsi avatar carlomazzaferro avatar javierhuertay avatar mpette200 avatar noahsa avatar ryanvolpi avatar vtoliveira avatar wilfreddesert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-hts's Issues

Possible [BUG] - High memory usage causing colab notebooks to crash and local system swap usage.

Caveat that I'm new to using most of these libraries, so this could just be operator error. My expectation was that given that these examples were based on a Kaggle competition, they should generally be able to run on readily available free/cheap systems.

Describe the bug
I attempted to follow the M5 examples in the "examples" repo. I could not get the docker container to work properly locally so I decided to follow the more manual process. On my local machine it's taking hours to run and using all available memory and significant swap, in colab the notebook crashes in a matter of seconds due to using all available memory

To Reproduce
Steps to reproduce the behavior:

On my local machine (2013 macbook pro, conda python 3.7) - I can get the code to run, by it took significantly more time that I expected:

With the following settings, this cell took 6 hours to run (copied from the M5 example notebook):
from hts import HTSRegressor clf = HTSRegressor(model='prophet', revision_method='OLS') model = clf.fit(df, hierarchy)

Switching to the "low memory" mode - it actually ran faster.. only 2 hours. however the very next cell, the forecast cell, has been running for over 12 hours and is only at 60%. Both training and prediction have taken over my system resources, I'm seeing several python3 processes with the worst one using 20gb of memory (only 8gb actually on board), so this is clearly using a LOT of swap space, which I expect is responsible for slowing it down..
from hts import HTSRegressor clf = HTSRegressor(model='prophet', revision_method='OLS', low_memory=True) model = clf.fit(df, hierarchy)

Finally, attempting to run it in a colab notebook caused the notebook to crash due to using up all available ram in <30s.
To be clear i'm referring to this same code:
from hts import HTSRegressor clf = HTSRegressor(model='prophet', revision_method='OLS', low_memory=True) model = clf.fit(df, hierarchy)

Expected behavior
A clear and concise description of what you expected to happen.

  • relatively quick run time, not crashing.

I'm trying to understand if this is expected behavior, user error, or a possible bug related to memory management or starting too many processes at once. Based on the fact that the example notebook was designed with the goal of demonstrating the efficacy of the package for solving a kaggle competition, and the fact that several leading notebooks run in a minute or two, I though it was unusual to experience multi-hour run times and crashes when using Colab. Did I just need different settings, or is this a bug?

Desktop (please complete the following information):

  • OS: [e.g. OS X Catalina] OSX Catalina,
  • scikit-hts version: [e.g. 0.2.1] packages from the "requirements.txt"
  • Python version: [e.g. 3.7.4] python 3.7 (anaconda),

Colab - default colab settings. Out of curiosity, I tried it with both TPU and GPU, but as expected, they made no difference.

Recursion Error

I cannot run more than two levels in local machine using Jupyter notebook.
hierarchy = {**total, **state_h} --> works
hierarchy = {**total, **state_h, **store_h} ---> fails with below error

print(sys.getrecursionlimit())
3000


RecursionError Traceback (most recent call last)
in
3 df = df.resample("W").sum()
4 clf = HTSRegressor(model='prophet', revision_method='OLS', n_jobs=12)
----> 5 model = clf.fit(df, hierarchy)
6
7 preds = model.predict(steps_ahead=1)

~\Anaconda3\lib\site-packages\hts\core\regressor.py in fit(self, df, nodes, tree, exogenous, root, distributor, disable_progressbar, show_warnings, **fit_kwargs)
194 """
195
--> 196 self.__init_hts(nodes=nodes, df=df, tree=tree, root=root, exogenous=exogenous)
197
198 nodes = make_iterable(self.nodes, prop=None)

~\Anaconda3\lib\site-packages\hts\core\regressor.py in __init_hts(self, nodes, df, tree, root, exogenous)
131 self.nodes = tree
132 else:
--> 133 self.nodes = HierarchyTree.from_nodes(nodes=nodes, df=df, exogenous=exogenous, root=root)
134 self.exogenous = exogenous
135 self.sum_mat = to_sum_mat(self.nodes)

~\Anaconda3\lib\site-packages\hts\hierarchy_init_.py in from_nodes(cls, nodes, df, exogenous, root, top, stack)
176 root=tmp_root,
177 top=top,
--> 178 stack=stack) # pass tmp_root to the function as root
179 return top
180

... last 1 frames repeated, from the frame below ...

~\Anaconda3\lib\site-packages\hts\hierarchy_init_.py in from_nodes(cls, nodes, df, exogenous, root, top, stack)
176 root=tmp_root,
177 top=top,
--> 178 stack=stack) # pass tmp_root to the function as root
179 return top
180

RecursionError: maximum recursion depth exceeded while calling a Python object

[DOCS] Documentation Example of how to apply top-down, middle-out, bottom-up approach.

Hi,

In the fpp2 book on the chapter Forecasting hierarchical or grouped time series, it is possible to easily switch between different methods (top-down, middle-out, bottom-up) when using the hts approach.

prison.gts <- gts(prison/1e3, characters = c(3,1,9),
  gnames = c("State", "Gender", "Legal",
             "State*Gender", "State*Legal",
             "Gender*Legal"))
forecast(prison.gts, method="bu", fmethod="arima")

can you provide an example of how this can be done using scikit-hts package

[BUG] Bug with prediction output since 0.5.4

Describe the bug

Predictions at lower levels have strong bias in versions 0.5.6, 0.5.5, and 0.5.4.
Only back to 0.5.3 do plots of predictions look more in line with intuition when compared to true observations.

Example used was this notebook from the examples.

  • Plot for auto-ARIMA model using 0.5.6 (please ignore ordering of subplots):
    example_notebook_0_5_6_plot
  • Plot using 0.5.3:
    example_notebook_0_5_3_plot
  • Plot from the notebook for reference:
    example_notebook_ref_plot

To Reproduce

Steps to reproduce the behavior:

  1. Replicate this example notebook with scikit-hts>=0.5.4.
  2. Reach the step near the end for plotting the predictions from the auto-ARIMA model (issue persists with Holt Winter’s exponential smoothing also).

Expected behavior

Predictions should not have strong positive or negative biases.

Desktop (please complete the following information):

  • OS: Win 10 Pro
  • scikit-hts version: >=0.5.4
  • Python version: 3.8.3

[ENHANCEMENT] Add warning step when dealing with string index instead of int / datetime

Is your proposed enhancement related to a problem? Please describe.
Sometimes you create a dataframe with date as a string format, then when doing prediction, it will raise following error:

image

The thing is that happens only at prediction time, and not at fitting time. Since depending on your dataset, you spend a lot of time fitting model, would be nice to inform or even try to change index format apriori.

Describe the solution you'd like
Maybe trying to infer freq at prediction time, or just a Warning advising for a pd.to_datetime transformation.

Describe alternatives you've considered
Trying to change index of nodes after fitting, but this is burdensome.

[DOCS] Documentation

Hi, I was trying out the package but found problems even with the basic examples, The documentation describes the following code:

from datetime import datetime
from hts import HTSRegressor
from hts.utilities.load_data import load_hierarchical_sine_data

# load some data
s, e = datetime(2019, 1, 15), datetime(2019, 10, 15)
df = load_hierarchical_sine_data(s, e).resample('1H').apply(sum)
df.head()

reg = HTSRegressor(model='prophet', revision_method='OLS')
reg = reg.fit(df=hsd, nodes=sine_hier)
preds = ht.predict(steps_ahead=10)

This code won´t even run, variable names passed to the fit call aren´t even defined.

I think a correct version would be:

from datetime import datetime
from hts import HTSRegressor
from hts.utilities.load_data import load_hierarchical_sine_data
import hts

# load some data

if __name__ == "__main__":
    s, e = datetime(2019, 1, 15), datetime(2019, 10, 15)
    hsd = load_hierarchical_sine_data(start=s, end=e, n=10000)
    hier = {'total': ['a', 'b', 'c'],
                'a': ['a_x', 'a_y'],
                'b': ['b_x', 'b_y'],
                'c': ['c_x', 'c_y'],
                'a_x': ['a_x_1', 'a_x_2'],
                'a_y': ['a_y_1', 'a_y_2'],
                'b_x': ['b_x_1', 'b_x_2'],
                'b_y': ['b_y_1', 'b_y_2'],
                'c_x': ['c_x_1', 'c_x_2'],
                'c_y': ['c_y_1', 'c_y_2']
            }

    reg = HTSRegressor(model='prophet', revision_method='OLS')
    reg = reg.fit(df=hsd, nodes=hier)
    preds = reg.predict(steps_ahead=10)

Even so, this code fails when making predictions as shown here:

Traceback (most recent call last):
  File "d:/Users/jdcarvajal/Desktop/timeseers test/test_hts.py", line 25, in <module>
    preds = reg.predict(steps_ahead=10)
  File "D:\Users\jdcarvajal\Anaconda3\lib\site-packages\hts\core\regressor.py", line 311, in predict
    return self._revise(steps_ahead=steps_ahead)
  File "D:\Users\jdcarvajal\Anaconda3\lib\site-packages\hts\core\regressor.py", line 322, in _revise
    revised_index = self._get_predict_index(steps_ahead=steps_ahead)
  File "D:\Users\jdcarvajal\Anaconda3\lib\site-packages\hts\core\regressor.py", line 328, in _get_predict_index
    start = self.nodes.item.index[-1] + freq
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'NoneType'

I think the basic example shown in the documentation should be able to run smoothly. I apologize if this error is caused by an error in the installation, but for the looks of it, I think that is not the case.

Parallelize fits

Since each of the underlying models are constructed independently, offer the possibility to fit them in parallel.

Possible implementations:

  1. Barebones python multiprocessing
  2. joblib parallel

[BUG] BU revision seems have some bugs

Describe the bug

class method _y_hat_matrix seems like filter out
leaves part of forecasts, but use keys generated by range
as int, so when call y_hat_matrix in functions.py will yield error
of can’t find key, because the key of forecasts may be str from
df column names.

Additional context
to tackle it. should resort the key to make root key in the first (the order is
important) and perform simple edit.

[DOCS] Benchmarks

Document benchmarks of this library agains:

  1. Barebones forecasting methodologies without reconciliation
  2. Hyndamns R's implementation (https://cran.r-project.org/web/packages/hts/index.html)
  3. Methods against each other on a variety of datasets

Ideally, benchmarks would be easily reproducible and version controlled. In a top level directory named benchmarks we implement the benchmarks described. In a documentation page, we describe results and describe how new benchmarks can be added

ValueError

ValueError
ValueError: shapes (533,533) and (505,) not aligned: 533 (dim 1) != 505 (dim 0)

The shape of my modeling data frame is 42 rows × 505 columns.

clf = HTSRegressor(model='prophet', revision_method='OLS', n_jobs=12)
model = clf.fit(model_df, hierarchy)

The fit function is working fine and doing the 100% training. But when I am doing

preds = model.predict(steps_ahead=3)

I am getting value error after the 100% training is done. I have printed each step I am getting output till new_mat and hat_mat

def project(hat_mat: np.ndarray, sum_mat: np.ndarray, optimal_mat: np.ndarray) -> np.ndarray:
    new_mat = np.empty([hat_mat.shape[0], sum_mat.shape[0]])
    print(f"hat_mat: {hat_mat}")
    for i in range(hat_mat.shape[0]):
        new_mat[i, :] = np.dot(optimal_mat, np.transpose(hat_mat[i, :]))
    return new_mat

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

  • scikit-hts version: 0.5.3
  • Python version: python 3.7
  • Operating System: MacOS

Description

Trying to recreate example available on scikit-hts-examples/notebooks/ using below data

from hts.utilities.load_data import load_hierarchical_sine_data

sine_hier = {'total': ['a', 'b', 'c'], 'a': ['aa', 'ab'], 'aa': ['aaa', 'aab'], 'b': ['ba', 'bb'], 'c': ['ca', 'cb', 'cc', 'cd']}
s, e = datetime(2019, 1, 15), datetime(2019, 10, 15)
df = load_hierarchical_sine_data(s, e).resample('1H').apply(sum)

What I Did

Works
reg = HTSRegressor(model='sarimax', revision_method='OLS')
reg = reg.fit(df=df, nodes=sine_hier)

Does not work
preds = reg.predict(steps_ahead=10)

I also tried different models but none works.

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/user/anaconda3/lib/python3.7/site-packages/hts/utilities/distribution.py", line 40, in _function_with_partly_reduce
    return list(results)
  File "/home/user/anaconda3/lib/python3.7/site-packages/hts/utilities/distribution.py", line 39, in <genexpr>
    results = (map_function(chunk, kwargs) for chunk in chunk_list)
  File "/home/user/anaconda3/lib/python3.7/site-packages/hts/core/utils.py", line 99, in _do_actual_predict
    **function_kwargs['predict_kwargs'])
  File "/home/user/anaconda3/lib/python3.7/site-packages/hts/model/ar.py", line 108, in predict
    return self._set_results_return_self(in_sample_preds, y_hat)
  File "/home/user/anaconda3/lib/python3.7/site-packages/hts/model/base.py", line 69, in _set_results_return_self
    self.residual = (in_sample - self.node.get_series()).values
  File "/home/user/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method
  File "/home/user/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 503, in wrapper
    right = to_series(right)
  File "/home/user/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 197, in arithmetic_op
  File "/home/user/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 149, in na_arithmetic_op
    result = masked_arith_op(left, right, op)
  File "/home/user/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py", line 228, in evaluate
    use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

How to make forecasts strictly positive?

  • scikit-hts version: 0.5.5
  • Python version: 3.6
  • Operating System: Windows

Description

Want to make my predictions to be strictly positive. For that, I want to make use of the log transformation function and I've passed a custom transformation function to the transform parameter. The final results still have negative forecasts. I want to know if I'm passing the custom function correctly. If yes, want to know why the results are negative?.

I've created the custom function in 2 ways.
Here goes the first one.

from collections import namedtuple
transform = namedtuple('transform', {'func': 'Callable', 'inv_func': 'Callable'})
transformer = transform(np.log1p, np.exp)
htsmodel = hts.HTSRegressor(model = 'auto_arima', revision_method = 'BU', n_jobs = 0, transform = transformer)

And the second one

transformer = hts._t.Transform(np.log1p, np.exp)
htsmodel = hts.HTSRegressor(model = 'auto_arima', revision_method = 'BU', n_jobs = 0, transform = transformer)

Unfortunately, None of them have any effect in bringing out the positive forecasts. So, Am I passing them the wrong way? or It's something else.

PS: When I'm transforming them explicitly before fitting and back-transforming them after predictions. The forecasts are positive.

Does HTS handle linear regression in a forecast?

Carlo,

I have historical data which has both seasonal and linear components (e.g. sales that are both increasing month on month, and also have seasonal variations). Does the HTSregressor handle it? I am getting all negative forecasts when I tried your function. I have a hierarchy of 14 countries and 12 products under each country. I used Sarimax with revision method = 'OLS'.. Should I be using some other revision method?

Vasuki

Carlo, I was able to get rid of the negative values. But only problem is that forecast is flat even though the historical data has obvious linear trends and seasonal trends. Do we need to pass on some values to the underlying model (e.g. auto_arima, hot_winters). If so, how do you pass on those values though HTSRegressor?

Vasuki

[BUG] Initialization HTSRegressor throws error requesting pmdarima

Describe the bug
On a fresh install trying to import HTSRegressor throws an error
ModuleNotFoundError: No module named 'pmdarima'

To Reproduce

pip install -U scikit-hts, pmdarima

Within python

from hts import HTSRegressor

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-14-98f3b7bae235> in <module>
----> 1 from hts import HTSRegressor

~/venv/python3.8/lib/python3.8/site-packages/hts/__init__.py in <module>
----> 1 from hts.core.regressor import HTSRegressor
      2 from hts.revision import RevisionMethod
      3 
      4 __author__ = """Carlo Mazzaferro"""
      5 __email__ = '[email protected]'

~/venv/python3.8/lib/python3.8/site-packages/hts/core/regressor.py in <module>
      6 from sklearn.base import BaseEstimator, RegressorMixin
      7 
----> 8 from hts import model as hts_models, defaults
      9 from hts._t import Transform, NodesT, ExogT, Model
     10 from hts.core.exceptions import MissingRegressorException, InvalidArgumentException

~/venv/python3.8/lib/python3.8/site-packages/hts/model/__init__.py in <module>
      1 from hts._t import Model
      2 
----> 3 from hts.model.ar import AutoArimaModel, SarimaxModel
      4 from hts.model.es import HoltWintersModel
      5 from hts.model.p import FBProphetModel

~/venv/python3.8/lib/python3.8/site-packages/hts/model/ar.py in <module>
      6 from hts.hierarchy import HierarchyTree
      7 from hts._t import Model
----> 8 from hts.model.base import TimeSeriesModel
      9 
     10 logger = logging.getLogger(__name__)

~/venv/python3.8/lib/python3.8/site-packages/hts/model/base.py in <module>
      3 import numpy
      4 import pandas
----> 5 from pmdarima import AutoARIMA
      6 from scipy.special._ufuncs import inv_boxcox
      7 from scipy.stats import boxcox

ModuleNotFoundError: No module named 'pmdarima'

Expected behavior
To import HTSRegressor without module import error

Desktop (please complete the following information):

  • OS: "Ubuntu 18.04.4 LTS"
  • scikit-hts version: 0.5.1
  • Python version: 3.8.1

Additional context

[ENHANCEMENT] Make fitting with exogenous variables more relaible

Is your proposed enhancement related to a problem? Please describe.
Fitting with exogenous variables has no testing of any kind, and is likely broken for some of the models

Describe the solution you'd like
Test cases for fitting with exogenous variables for each of the models

Describe alternatives you've considered
N/A.

Additional context
N/A

[BUG] Predict function fails when using the revision methods 'AHP' and 'PHA' while also setting transform=True

Describe the bug
The predict function of a HTSRegressor throws the error ''bool' object has no attribute 'inverse_transform' when it is called with the revision method 'AHP' or 'PHA' and transform is set to True.

When the revision method is instantiated inside the _init_revision function, self.transform, which is a boolean value, is passed to the RevisionMethod class but it is expecting a transformer function to be passed. This causes the revision methods 'AHP' and 'PHA' to throw an error when the revise method from revision.py is called, as they try to use the inverse transform function.

To Reproduce
Create a HTSRegressor, setting the revision method to either 'AHP' or 'PHA' and set transform to True. Fit the model to data and then call the predict function.

Expected behavior
The predict function runs without error and outputs the correct predictions

Desktop (please complete the following information):
N/A

Additional context
This was run on the free version of colab

Decrease memory usage

A single Prophet model fit on a fairly small dataset generates a model of aaround 2GB: https://gist.github.com/carlomazzaferro/41c13d73067c08f9a7c2f5b148eccbe0

When performing fits on hundreds of models this become a bottleneck, as each of the models is kept in memory, as they need to be available during the prediction step.

Proposed solutions:

  1. Limit the behavior of FBProphetModel to implement only fit_predict, removing the need to keep models in memory
  2. For each time series, fit and serialize the model, keeping in memory only a reference to the model's serialized path. On predict, load each model one by one

[BUG] Prediction broken with KeyError: "Invalid unit abbreviation: QS-OCT"

Describe the bug
After fitting a model using HTSRegressor when calling predict method there is an error coming from pandas timedelta functionality, apparently pandas.Timedelta does not accept quarterly frequency indexes.

image

To Reproduce
I used visnights data got from Hyndman book, I can actually upload the notebook for you in scikit-examples so we can further investigate that.

Expected behavior
My last date was:

self.nodes.item.index.max()
Out: Timestamp('2016-10-01 00:00:00', freq='QS-JAN')

And expected output was:

DatetimeIndex(['2017-01-01', '2017-04-01', '2017-07-01', '2017-10-01',
               '2018-01-01'],
              dtype='datetime64[ns]', freq='QS-JAN')

for steps_ahead = 4.

Desktop (please complete the following information):

  • OS: WSL
  • scikit-hts version: 0.5.0
  • Python version: 3.7.4

Additional context

I did a google research, and the problem is that pandas.Timedelta does not accept Quarterly frequencies, and not even monthly / yearly ones, that could be a problem.

[BUG] Bug report - exog variables fitting

Hi,
(Seems that I opened the issue with the wrong repo, at scikit-hts-examples, and got no response since Jan 4. But so glad I found this one)

I am following the hts documentation to add exogenous variables to a model.

So far I succeeded in using the hmv load_mobility_data(), creating exogenous features for each node:
exogenous = {k: ['precipitation', 'temp'] for k in hmv.columns if k not →in ['precipitation', 'temp']}
and pass it in

clf = HTSRegressor(model='prophet', revision_method='OLS',  n_jobs=10)
model = clf.fit(hmv, hier, exogenous=exogenous)

The model ran fine, but I can't figure out how to pass in the exogenous_df in the predict() function.
The documentation says:

Parameters
• exogenous_df (pandas.DataFrame) – A dataframe of length == steps_ahead containing the exogenous data for each of the nodes

For the hmv data, the exogenous features were ["temperature", "precipitation"]
so I passed in the data frame of 7 rows for predicting the next 7 days as exogenous_df:

ds	     precipitation     temp
2016-09-01	0.00000	    77.00000
2016-09-02	0.00000	    74.00000
2016-09-03	0.00000	    66.00000
2016-09-04	0.00000	    68.00000
2016-09-05	0.00000	    68.00000
2016-09-06	0.00000	    64.00000
2016-09-07	0.00000	    65.00000

and ran preds = model.predict(steps_ahead=7, exogenous_df=exogenous_df)

But I'm not sure how to pass in the node information in the data frame.
The hierarchy of the hmv data is:

{'total': ['CH', 'SLU', 'BT', 'OTHER'],
 'CH': ['CH-07', 'CH-02', 'CH-08', 'CH-05', 'CH-01'],
 'SLU': ['SLU-15', 'SLU-01', 'SLU-19', 'SLU-07', 'SLU-02'],
 'BT': ['BT-01', 'BT-03'],
 'OTHER': ['WF-01', 'CBD-13']}

The error I got (which is probably related to the missing node info in the above exogenous_df) is:

~/analytics-etl/virtualenv/lib/python3.8/site-packages/hts/core/regressor.py in predict(self, exogenous_df, steps_ahead, distributor, disable_progressbar, show_warnings, **predict_kwargs)
    270         """
    271 
--> 272         steps_ahead = self.__init_predict_step(exogenous_df, steps_ahead)
    273         predict_function_kwargs = {'fit_kwargs': predict_kwargs,
    274                                    'steps_ahead': steps_ahead,

~/analytics-etl/virtualenv/lib/python3.8/site-packages/hts/core/regressor.py in __init_predict_step(self, exogenous_df, steps_ahead)
    222 
    223     def __init_predict_step(self, exogenous_df: pandas.DataFrame, steps_ahead: int):
--> 224         if self.exogenous and not exogenous_df:
    225             raise MissingRegressorException(f'Exogenous variables were provided at fit step, hence are required at '
    226                                             f'predict step. Please pass the \'exogenous_df\' variable to predict '

~/analytics-etl/virtualenv/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Can someone provide an example of exogenous_df for the predict() function?

Thank you!

[BUG] WLSS and WLSV are treated the sames as OLS by revise

Describe the bug
Revise method in revision.py treats WLSS and WLSV as OLS

To Reproduce
Specify the reconciliation method as WLSS or WLSV when using the hts regressor.

Expected behavior
When self.name is OLS, WLSS, or WLSV, I would expect the revise method to pass method = self.name to optimal_combination() but instead, it passes method=MethodsT.OLS.name

Desktop (please complete the following information):

  • OS: Windows 10
  • scikit-hts version: [e.g. 0.2.1]
  • Python version: 3.7.1

Additional context

[BUG] Python segmentation fault

Describe the bug

I am trying to use the package for hierarchical time series using prophet. My code worked properly up to and including a 4 level hierarchy. But when I extend hierarchy to 5 and 6 levels I get a python segmentation fault.

Fatal Python error: Segmentation fault

Current thread 0x00000001170e1e00 (most recent call first):
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexers.py", line 258 in maybe_convert_indices
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1467 in take
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 3585 in take
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 3599 in _take_with_is_copy
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3036 in getitem
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/hts/hierarchy/init.py", line 163 in from_nodes
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/hts/hierarchy/init.py", line 173 in from_nodes
File "/Users/ndah/opt/anaconda3/lib/python3.8/site-packages/hts/hierarchy/init.py", line 173 in from_nodes

To Reproduce

input
image

code:
hts = HTSRegressor(
model='prophet',
revision_method='OLS'
low_memory=True,
daily_seasonality=False,
weekly_seasonality=True,
yearly_seasonality=True,
n_jobs=4
)

hts_train = hts.fit(df=train_ts, nodes=hie.hierarchy, root='total')
hts_forcast = hts_train.predict(steps_ahead=days)

Expected behaviour
did not expect any error

Desktop (please complete the following information):

  • OS: MacOS Bigsur/Linux Mint
  • scikit-hts version: 0.5.3
  • Python version: 3.8.5

[ENHANCEMENT] add_seasonality in prophet model

Is your proposed enhancement related to a problem? Please describe.
In the non-hts version of prophet, we use external variables in both the external_regressor way and in the 'add_seasonality' way, since the performance of using one or the other differs per variable. Currently (as far as I can tell), this option is not yet available here.

Describe the solution you'd like
Ideally we would be able to make the distinction between adding exogenous variables as regressors and as seasonality.

Describe alternatives you've considered
One option would be to add an extra parameter to the hierarchyTree besides exogenous (f.e. named seasonality)

another would be to be able to specificy the distinction in the exogenous dictionary.
f.e. instead of filling exogenous with:
'total': ['general_variable1','general_variable2']
you would also be able to fill it with:
'total': {name:'general_variable1', type: 'external_regressor'}, {name:'general_variable2', type: 'seasonality'}

Additional context
Once we agree on what would be the best solution here, we might also be able to pick up (part of) the implementation depending on the complexity.

revise_forecasts() method raises an attribute error. AttributeError: 'Series' object has no attribute 'yhat'

I want to revise my base forecasts, so wanted to use revise_forecasts() method but got an error like this.

  • scikit-hts version: scikit-hts==0.5.4
  • Python version: 3.6
  • Operating System: Windows
AttributeError                            Traceback (most recent call last)
<ipython-input-191-307987f74b26> in <module>
----> 1 revise_forecasts(method = 'FP', forecasts = pred, summing_matrix = sum_mat, nodes = hierarchyNodes)

/home/Gopal/anaconda/envs/time_series/lib/python3.6/site-packages/hts/convenience.py in revise_forecasts(method, forecasts, errors, residuals, summing_matrix, nodes, transformer)
     63     )
     64 
---> 65     revised = revision.revise(forecasts=forecasts, mse=errors, nodes=nodes)
     66 
     67     return pandas.DataFrame(revised, columns=list(forecasts.keys()))

/home/Gopal/anaconda/envs/time_series/lib/python3.6/site-packages/hts/revision.py in revise(self, forecasts, mse, nodes)
     74 
     75         elif self.name == MethodT.FP.name:
---> 76             return forecast_proportions(forecasts, nodes)
     77 
     78         else:

/home/Gopal/anaconda/envs/time_series/lib/python3.6/site-packages/hts/functions.py in forecast_proportions(forecasts, nodes)
    196 
    197     key = choice(list(forecasts.keys()))
--> 198     new_mat = np.empty([len(forecasts[key].yhat), n_cols - 1])
    199     new_mat[:, 0] = forecasts[key].yhat
    200 

/home/Gopal/anaconda/envs/time_series/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5138                 return self[name]
-> 5139             return object.__getattribute__(self, name)
   5140 
   5141     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'yhat'

What could be the best fix for this problem?

[ENHANCEMENT] Ability to use different metrics for calculating forecasting errors

Is your proposed enhancement related to a problem? Please describe.
Right now the metric used to calculate the forecasting error is hard coded as the MSE: https://github.com/carlomazzaferro/scikit-hts/blob/master/hts/model/base.py#L69

Providing the ability to use custom metrics would greatly improve the usability of the library, as well as tailoring it to different use-cases

Describe the solution you'd like
Being able to pass a string for a built-in metric, or a callable. In case of a callable, the user must ensure the metric is implemented correctly. Examples of possible custom metrics in documentation should be added as well.

Describe alternatives you've considered
None, this is the first formalisation of the issue

Additional context
Metrics that should be built in:

Ideally the implementation follows the same interface as sklearn metric's: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

[BUG] Bug report

Describe the bug
Cannot import name 'HierarchyTree' from 'hts

To Reproduce
from hts import HierarchyTree

Expected behavior
Expected HierarchyTree to be imported in order to define a hierarchy to pass into HTSRegressor.

Desktop (please complete the following information):

  • OS: Windows 10
  • scikit-hts version: 0.5.2
  • Python version: 3.7.3

Additional context

ImportError: cannot import name 'HierarchyTree' from 'hts' (C:\Users\Private\AppData\Local\Continuum\anaconda3\lib\site-packages\hts_init_.py)

[FEATURE] Automatic Generation of Hierarchy Dataset and Mapping

Is your feature request related to a problem? Please describe.
It's very complex to create the hierarchical pivot dataset from a dataset in its long form especially if there are many levels in the hierarchy

Describe the solution you'd like
A method which takes in the date field and a list of levels (column names) in the dataset and generates the pivot table at the date level and the hierarchy mapping dictionary after learning the relationships from the dataset itself

Describe alternatives you've considered
Manually creating the dataset and dictionary which can be tedious and error prone

Weird Result from HTS using Prophet

Describe the bug
Defining HTSRegressor with model "prophet"

clf = HTSRegressor(model='prophet', revision_method='OLS')
model1 = clf.fit(df_train, hierarchy)

returns something like a lineal regression.
Plotting the results of the prediction using scikit-hts and fbprophet as

model = fbp.Prophet(yearly_seasonality=True)
model.fit(df, algorithm='LBFGS')
future = model.make_future_dataframe(periods=days_predict, freq='D')
forecast = model.predict(future)

the results obtained are really different.

From clf1 we get

delete

Expected behavior
I expected to get a curve that could recognise well trends, and also seasonalities and others features that prophet is capable of...

Desktop (please complete the following information):

  • scikit-hts version: [e.g. 0.5.3]
  • Python version: [e.g. 3.8.3]

[BUG] UserWarning coming from predict in auto_arima

Describe the bug
When fitting models using auto_arima, in prediction time a UserWarning for each node is logged into the screen.

image

To Reproduce
Fit a HTSRegressor using auto_arima a model option.

Expected behavior
No warning.

Desktop (please complete the following information):

  • OS: WSL
  • scikit-hts version: 0.5.1
  • Python version:3.7.4

Additional context
I am not sure if this fits into a bug template of issues, please, tell me if it is more appropriate another tag.

[FEATURE] Decouple reconciliation methods from HTSRegressor's implementation

Is your feature request related to a problem? Please describe.
I'd like to be able to use reconciliation methods with my own supplied forecasts, instead of relying on the ones provided by hts. This would allow:

  1. Custom model architectures and parameters for each of the nodes
  2. Separation of concerns, which makes things more modular and testable
  3. Easier to benchmark quality of results

Describe the solution you'd like
Possibly, implementing a scikit-like interface for the revision methods, as they are just matrix transformations

Describe alternatives you've considered
Keep it as is

[FEATURE] Creating Grouped Time Series Structure

Is your feature request related to a problem? Please describe.
It's not a problem. I am looking to create a grouped time series structure, as opposed to a hierarchical.

Describe the solution you'd like
I was wondering if you have any experience, suggestions, or an example would be wonderful demonstrating how to use this library to create a grouped time series structure? Or if it is possible in the current version?

Describe alternatives you've considered
I have looked through the documentation and have not been able to find references specific to creating grouped time series. Two attempted solutions are:

  1. Create a group structured dictionary to pass to hts.hierarchy.HierarchyTree.from_nodes(). Although, I have not been able to figure out how to structure the dictionary to create the correct summing matrix.

  2. Use hts.hierarchy.utils.groupify(), although similarly I was unsure how to utilize this function correctly to create a tree with a grouping structure.

Additional context
Please let me know if you need any additional information. This seems like a very cool library, so I would be interested in contributing if there are small things to be done to enable this and other functionality. Any guidance is appreciated.

[BUG] Bug report: AttributeError: 'NoneType' object has no attribute 'fit'.

Bug description
While trying to reproduce the example given here: https://scikit-hts.readthedocs.io/en/latest/usage.html, I got the following error. Same error appears when running https://github.com/carlomazzaferro/scikit-hts-examples/blob/master/notebooks/visnights.ipynb too.

To Reproduce
Steps to reproduce the behavior:

from hts import HTSRegressor
from hts.utilities.load_data import load_hierarchical_sine_data
from datetime import datetime

# load some data
s, e = datetime(2019, 1, 15), datetime(2019, 10, 15)
df = load_hierarchical_sine_data(s, e).resample('1H').apply(sum)

hier = {
    "total": ["a", "b", "c"],
    "a": ["aa", "ab"],
    "aa": ["aaa", "aab"],
    "b": ["ba", "bb"],
    "c": ["ca", "cb", "cc", "cd"],
}

reg = HTSRegressor(model='prophet', revision_method='OLS')
reg = reg.fit(df=df, nodes=hier)

Throws the following error
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\rft\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "c:\users\joshipan\downloads\scikit-hts\hts\utilities\distribution.py", line 40, in _function_with_partly_reduce
return list(results)
File "c:\users\joshipan\downloads\scikit-hts\hts\utilities\distribution.py", line 39, in
results = (map_function(chunk, kwargs) for chunk in chunk_list)
File "c:\users\joshipan\downloads\scikit-hts\hts\core\utils.py", line 39, in _do_actual_fit
model_instance = instantiated_model.fit(**function_kwargs['fit_kwargs'])
File "c:\users\joshipan\downloads\scikit-hts\hts\model\p.py", line 87, in fit
self.model = self.model.fit(df)
AttributeError: 'NoneType' object has no attribute 'fit'
"""

The above exception was the direct cause of the following exception:

AttributeError Traceback (most recent call last)
in ()
16
17 reg = HTSRegressor(model='prophet', revision_method='OLS')
---> 18 reg = reg.fit(df=df, nodes=hier)

c:\users\joshipan\downloads\scikit-hts\hts\core\regressor.py in fit(self, df, nodes, tree, exogenous, root, distributor, disable_progressbar, show_warnings, **fit_kwargs)
211 n_jobs=self.n_jobs,
212 disable_progressbar=disable_progressbar, show_warnings=show_warnings,
--> 213 distributor=distributor
214 )
215

c:\users\joshipan\downloads\scikit-hts\hts\core\utils.py in _do_fit(nodes, function_kwargs, n_jobs, disable_progressbar, show_warnings, distributor)
25 result = distributor.map_reduce(_do_actual_fit,
26 data=nodes,
---> 27 function_kwargs=function_kwargs)
28 distributor.close()
29 return result

c:\users\joshipan\downloads\scikit-hts\hts\utilities\distribution.py in map_reduce(self, map_function, data, function_kwargs, chunk_size, data_length)
164 disable=self.disable_progressbar)
165
--> 166 result = list(itertools.chain.from_iterable(result))
167
168 return result

C:\ProgramData\Anaconda3\envs\rft\lib\site-packages\tqdm\std.py in iter(self)
1106 fp_write=getattr(self.fp, 'write', sys.stderr.write))
1107
-> 1108 for obj in iterable:
1109 yield obj
1110 # Update and possibly print the progressbar.

C:\ProgramData\Anaconda3\envs\rft\lib\multiprocessing\pool.py in next(self, timeout)
733 if success:
734 return value
--> 735 raise value
736
737 next = next # XXX

AttributeError: 'NoneType' object has no attribute 'fit'

Desktop:

  • OS: [e.g. Windows 10, 64 bit]
  • scikit-hts version: 0.5.2
  • scikit-learn 0.23.2
  • Python version: 3.6.12

[FEATURE] Add support for Min-T reconciliation ("optimal reconciliation" using trace minimization and historical errors)

Is your feature request related to a problem? Please describe.
Not a problem exactly, but a missing piece of functionality that would make the package more useful (IMO) and more on par with the advanced hierarchical reconciliation capabilities available in R.

Describe the solution you'd like
I want to be able to do "optimal reconciliation" via trace minimization as in the R package hts (see https://www.rdocumentation.org/packages/hts/versions/6.0.1/topics/MinT, https://robjhyndman.com/papers/mint.pdf). This type of reconciliation allows the user to account for covariances of historical errors in reconciliation. Ideally, the solution would also allow for a nonnegative version (where reconciled forecasts are constrained to be nonnegative), as in https://robjhyndman.com/publications/nnmint.

Implementing this feature would entail extending the RevisionMethod class to include one more option for the reconciliation methods, as well as adding auxiliary code to support the Min-T approach in functions.py. Min-T reconciliation requires access to residuals (not just mse), so those will need to be passed to methods _revise() in HTSRegressor and revise() in RevisionMethod. (There would be a similar change for the convenience function revise_forecasts() as well.)

Describe alternatives you've considered
WLSV reconciliation is the closest to Min-T, but it doesn't account for covariances between nodes, only for variances of individual nodes in the hierarchy.

Additional context
I'm happy to work on developing this (I've already started, actually), but I thought I would post this first to generate more discussion and visibility! Please chime in if you have any thoughts related to Min-T reconciliation or extending the package in this way more generally.

[BUG] AutoArima fails on exogenous example

Describe the bug
Passing of exogenous variables to hts.model.AutoArimaModel is of incorrect shape and therefore fails.

To Reproduce
Create a Hierarchy tree using training set like seen in examples with exogenous variables:

exogenous = {k: ['SNP', 'HMI'] for k in df_train.columns if k not in ['SNP', 'HMI']}
ht = HierarchyTree.from_nodes(hier, df_train, root='United States', exogenous=exogenous)
hier_df = ht.to_pandas()

Fit using AutoArimaModel

clf = AutoArimaModel(ht, start_p=1, start_q=1, max_p=3, max_q=3, stepwise=True, trace=True, m=12, maxiter=30 , n_jobs=10)
model = clf.fit(df=hier_df, nodes=hier, root='United States')

Create a hierarchy tree using test set
exo = {k: ['SNP', 'HMI'] for k in df_test.columns if k not in ['SNP', 'HMI']}
ht_test = HierarchyTree.from_nodes(hier, df_test, root='United States', exogenous=exo)

Predict using test hierarchy tree
predicted_autoarima = model.predict(ht_test.get_node("TX"), steps_ahead=10)

Expected behavior
As seen in the class hts/model/ar.py predict(), if exogenous is set to true, it sets exogenous to node.item. In this case, node.item contains the exogenous variables as well as the node itself. this makes the size of the dataframe size(exogenous) + 1
This is confirmed by the error message
Provided exogenous values are not of the appropriate shape. Required (10, 2), got (10, 3).
The solution as I see it would be setting exogenous to the proper slice (node.item.iloc[:,1:])

Desktop (please complete the following information):

  • OS: Ubuntu 20.04
  • scikit-hts version: 0.5.3
  • Python version: 3.8.3

Additional context
Reassigning the node to
test_node = ht_test.get_node("TX")
test_node.item = test_node.item.iloc[:,1:] and passing that into predict is not a workaround

[ENHANCEMENT] Decrease memory footprint

Is your proposed enhancement related to a problem? Please describe.
Each of the nodes requires a model to be kept in memory after fitting. This is clearly a problem when working with a very large number of nodes. After improvements (12ccbbe) Prophet still takes ~0.5 MB of memory per model for a simple set of univariate series of ~1900 time steps.

Describe the solution you'd like
A solution could serialize models after fitting, keeping a reference to the path of the model only. On prediction, each of the models are loaded, inference is performed, and the models are then released from memory.

Describe alternatives you've considered
While serializing the models might be overkill, and require some tinkering around the implementation (where are the models going to be written to? Will there be any permissions issues? What library can be used for serializing?), I don't see any evident alternatives.

Proposed interface

from hts import HTSRegressor

ht = HTSRegressor()

# add the low_memory and model_path kwargs 
ht.fit(df, nodes, low_memory=True, model_path='~/.scikit-hts/tmp/')

Proposed implementation

  1. Check if directory exists and is writable by current user
  2. Create directory if needed
  3. After calling fit on the TimeSeriesModel chosen (https://github.com/carlomazzaferro/scikit-hts/blob/master/hts/core/regressor.py#L194), instead of setting the self.hts_result.models = (node.key, model_instance) to the actual model instance, we set it to the model's serialized path. The name must be intuitive but also unique, so likely using the node's name.
  4. Prediction works in the same way, node key is used to load the model and make the prediction

[FEATURE] Adding datasets component with basic hierarchical data.

Is your feature request related to a problem? Please describe.
Adding a datasets component to download some base hierarchical structure, like the one in visitors night in australian used by Hyndman in his book.

Describe the solution you'd like
A simple section of datasets such as we see in pmdarima, scikit-learn, etc.

Example:

scikit-hts requires that each column to be a node in our dataframe, however, almost any standard data comes in long format, so we would include this long format data and transform it to demonstrate the entire process.

image

from hts.datasets import visnights

[FEATURE] Distributed training

Is your feature request related to a problem? Please describe.
As highlighted by #5, memory issues might arise when training on a large number of nodes. Providing the ability to run distributed training would greatly improve the usability of the library for real-world scenarios

Describe the solution you'd like
Ability to run distributed training, ideally with dask or any other distributed processing framework

Describe alternatives you've considered
Parellalizing training might speed up training, but it does not make memory issues any easier

Additional context
See #2 (comment)

[BUG] Boxcox transform fails for a Prophet model

Describe the bug
I am unable to obtain predictions from a FBProphetModel when transform=True.
The error message implies that it is unable to safely apply the inv_boxcox method to the values returned by the Prophet model.
I suspect this is because the model is returning negative values.
To Reproduce

from hts import HTSRegressor
from hts.utilities.load_data import load_mobility_data
df = load_mobility_data()
hier = {
        'total': ['CH', 'SLU', 'BT', 'OTHER'],
        'CH': ['CH-07', 'CH-02', 'CH-08', 'CH-05', 'CH-01'],
        'SLU': ['SLU-15', 'SLU-01', 'SLU-19', 'SLU-07', 'SLU-02'],
        'BT': ['BT-01', 'BT-03'],
        'OTHER': ['WF-01', 'CBD-13']
    }
reg = HTSRegressor(model='prophet', transform=True)
reg = reg.fit(df=df, nodes=hier)
reg.predict(num_steps=30)
Traceback:
...
TypeError: ufunc 'inv_boxcox' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected behavior
When calling the underlying Prophet model, the input data should be transformed to ensure that the predicted values are positive

Desktop (please complete the following information):

  • OS: Linux arch-5.5.7
  • scikit-hts version: 0.5.1
  • Python version: 3.7.6

Additional context
I've tried forcing prophet to fit to positive values using

reg = HTSRegressor(model='prophet', capacity_min=0, capacity_max=10000)
reg.fit(df, nodes=hier)
reg.predict(steps_ahead=30)

which fails because the underlying dataframes have no 'cap' value.

Cannot build hierarchy with large dataset

  • scikit-hts version: 0.5.1
  • Python version: 3.7
  • Operating System: Windows 10

Description

Jupyter notebook's kernel is dead when trying to build HierarchyTree with a very large dataset.

What I Did

My dataframe has 64 rows x 205664 columns. After building nodes I proceed to build the tree (I want to make sure before fitting it). I use Jupyter notebook. It runs for some time and the kernel is dead. Then I restart and run again, the kernel dies every time it reaches this part.
Maybe my tree is really deep and the leaves are really wide.

To reproduce
Build a hierarchy tree with a really large dataset.

Expected behavior
Tree is successfully built.

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

[BUG] Getting Error Message when Using the Package to Predict

Describe the bug
The following error message comes from the predict function:
Screen Shot 2020-08-29 at 1 31 17 PM

To Reproduce
Input:
Screen Shot 2020-08-29 at 1 32 51 PM

The type of index is string, store columns with a total column in the dataframe
Tried to convert the index to datetime, but getting another error message for incompatibility of timestamp and object

Code:
clf = HTSRegressor(model='prophet', revision_method='OLS', n_jobs=12)
model = clf.fit(hts_ssm, hier)
preds = model.predict(steps_ahead=2,freq='M')

also tried freq='D' but getting the same error.

Expected behavior
Expected not getting an error.

Desktop (please complete the following information):

  • OS: [OS X Catalina]
  • scikit-hts version: [0.2.1]
  • Python version: [ 3.7.4]

Additional context
Add any other context about the problem here.

[FEATURE] Add option for method = "NONE"

Is your feature request related to a problem? Please describe.
There is currently no way to compare the revised predictions against unrevised predictions.

Describe the solution you'd like
An option to specify that the HTSRegressor applies no revision method and simply provides the predictions made by the individual models at each level.

Describe alternatives you've considered

Additional context
I've implemented the feature very simply by adding the option in RevisionMethod.revise to return y_hat_matrix(forecasts = forecasts) when the method is NONE. Once I figure out how to correctly run the tests, I can submit a pull request if you think this is a worthwhile feature.

[ENHANCEMENT] Add support for other python versions / platforms

Is your proposed enhancement related to a problem? Please describe.
Add support and automated tests for python 3.5+, and for MacOS, Windows

Describe the solution you'd like
Add test matrix with tox

Describe alternatives you've considered
Tox, or any other solution that would run tests on travis ci for dfferent versions/platforms

[BUG] Probably a bug in how reconciliation works now

Describe the bug
It seems that currently there is either a bug in how reconciliation works or probably I just don't quite understand how it works :)

To Reproduce

        n_cols = len(list(forecasts.keys())) + 1
        if self.name == "BU":
            keys = list(forecasts.keys())[
                n_cols - self.sum_mat.shape[1] - 1 : n_cols - 1
            ]
        else:
            keys = range(n_cols - self.sum_mat.shape[1] - 1, n_cols - 1)
        return y_hat_matrix(forecasts, keys=keys)

Expected behavior
Surprisingly, in particular for the BU method some columns get removed from the dataframe and then the predicted dataframe has zeros instead of the forecasts.

from hts import HTSRegressor
from hts.hierarchy import HierarchyTree
from hts.utilities.load_data import load_hierarchical_sine_data, load_mobility_data
data = load_mobility_data()
hier = {
    "total": ["CH", "SLU", "BT", "OTHER"],
    "CH": ["CH-07", "CH-02", "CH-08", "CH-05", "CH-01"],
    "SLU": ["SLU-15", "SLU-01", "SLU-19", "SLU-07", "SLU-02"],
    "BT": ["BT-01", "BT-03"],
    "OTHER": ["WF-01", "CBD-13"],
}
train = data[:500]
test = data[500:507]
clf = HTSRegressor(model="prophet", n_jobs=10)
exogenous = {
    k: ["precipitation", "temp"]
    for k in data.columns
    if k not in ["precipitation", "temp"]
}
model = clf.fit(train.drop(columns=["precipitation", "temp"]), hier)
preds = model.predict(steps_ahead=7)

Here is how you can reproduce the issue.

Desktop (please complete the following information):

  • OS: Linux (Manjaro)
  • scikit-hts version: 0.5.11
  • Python version: 3.7.9

Additional context
It might be as well that I am missing something in the way it's supposed to work. I just noticed in the above example that the returned dataframe doesn't have correct predictions (for example, the column for CH has zeros everywhere). The issue arises prior to reconciliation: checking self.hts_result.forecasts confirms that the raw predictions are okay.

@carlomazzaferro , can you please take a look?

unsupported operand type(s) for +: 'Timestamp' and 'NoneType' when using model.predict

Describe the bug
After fitting the model, using model.predict(steps_ahead=n), it goes to 100% but then throws the following error: "unsupported operand type(s) for +: 'Timestamp' and 'NoneType' when using model.predict"

To Reproduce
I have a dataframe very similarly set up to the M5 notebook. I can't see any things in how it would
differ besides the numbers. The hierarchy is set up correctly.

Expected behavior
A clear and concise description of what you expected to happen.
model.predict() should finish without errors.

Desktop (please complete the following information):

  • OS: OS X Catalina
  • scikit-hts version: 0.5.1
  • Python version: 3.7.6

Additional context
I don't know exactly how predict works, so it seems it wants to use the index (datetime) and predict the next step..?

[BUG] Prophet model doesn't use upper/lower constraints

It seems best to have a separate issue to address this problem. The underlying prophet model is not able to do the min and max capacity constraints. The following reproduces the problem regardless of whether the underlying data is transformed.

reg = HTSRegressor(model='prophet', capacity_min=0, capacity_max=10000)
reg.fit(df, nodes=hier)
reg.predict(steps_ahead=30)

Not sure if there would be a way to either modify the internal generation of the future object or if can be passed by the user.

[FEATURE] ability to specify revision method during predict

Is your feature request related to a problem? Please describe.
If I want to create predictions with multiple different revision methods I need to repeatedly start over, define a new regressor, and retrain the all of the models.

Describe the solution like
The revision method only comes into play during the prediction stage and does not alter the training step whatsoever. Therefore, I think it should be an option to pass the revision method name to the predict function which will supercede the method specified to the htsregressor object.

Describe alternatives you've considered
You can set model.revision_method.name = 'OLS' before calling model.predict( ) to change what revision method is used. I don't think this is the best way to do it.

Additional context
Let me know your thoughts. I can implement this ASAP. Should a method be specified when instantiating the regressor object at all?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.