angus924 / minirocket Goto Github PK

View Code? Open in Web Editor NEW

276.0 2.0 32.0 57 KB

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

License: GNU General Public License v3.0

Python 100.00%

scalable time-series-classification convolution convolutional-kernel convolutional-neural-network

minirocket's Introduction

ROCKET · MINIROCKET · HYDRA

MINIROCKET

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

KDD 2021 / arXiv:2012.08791 (preprint)

Until recently, the most accurate methods for time series classification were limited by high computational complexity. ROCKET achieves state-of-the-art accuracy with a fraction of the computational expense of most existing methods by transforming input time series using random convolutional kernels, and using the transformed features to train a linear classifier. We reformulate ROCKET into a new method, MINIROCKET, making it up to 75 times faster on larger datasets, and making it almost deterministic (and optionally, with additional computational expense, fully deterministic), while maintaining essentially the same accuracy. Using this method, it is possible to train and test a classifier on all of 109 datasets from the UCR archive to state-of-the-art accuracy in less than 10 minutes. MINIROCKET is significantly faster than any other method of comparable accuracy (including ROCKET), and significantly more accurate than any other method of even roughly-similar computational expense. As such, we suggest that MINIROCKET should now be considered and used as the default variant of ROCKET.

Please cite as:

@inproceedings{dempster_etal_2021,
  author    = {Dempster, Angus and Schmidt, Daniel F and Webb, Geoffrey I},
  title     = {{MiniRocket}: A Very Fast (Almost) Deterministic Transform for Time Series Classification},
  booktitle = {Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  publisher = {ACM},
  address   = {New York},
  year      = {2021},
  pages     = {248--257}
}

Podcast

Hear more about MINIROCKET and time series classification on the Data Skeptic podcast!

GPU Implementation NEW

A GPU implementation of MINIROCKET, developed by Malcolm McLean and Ignacio Oguiza, is available through tsai. See the examples. Many thanks to Malcolm and Ignacio for their work in developing the GPU implementation and making it part of tsai.

`sktime`* / Multivariate

MINIROCKET (including a basic multivariate implementation) is also available through sktime. See the examples.

* for larger datasets (10,000+ training examples), the sktime methods should be integrated with SGD or similar as per softmax.py (replace calls to fit(...) and transform(...) from minirocket.py with calls to the relevant sktime methods as appropriate)

Results

UCR Archive (109 Datasets, 30 Resamples)
- Mean Accuracy + Training/Test Times
- Accuracy Per Resample
Scalability / Training Set Size*
- MosquitoSound (139,780 × 3,750)
- InsectSound (25,000 × 600)
- FruitFlies (17,259 × 5,000)
Scalability / Time Series Length
- DucksAndGeese (50 × 236,784)

* num_training_examples does not include the validation set of 2,048 training examples, but the transform time for the validation set is included in time_training_seconds

Requirements*

Python, NumPy, pandas
Numba (0.50+)
scikit-learn or similar
PyTorch or similar (for larger datasets)

* all pre-packaged with or otherwise available through Anaconda

Code

`minirocket.py`

`minirocket_dv.py` (MINIROCKET_DV)

`softmax.py` (PyTorch / 10,000+ Training Examples)

`minirocket_multivariate.py` (equivalent to sktime/MiniRocketMultivariate)

`minirocket_variable.py` (variable-length input; experimental)

Important Notes

Compilation

The functions in minirocket.py and minirocket_dv.py are compiled by Numba on import, which may take some time. By default, the compiled functions are now cached, so this should only happen once (i.e., on the first import).

Input Data Type

Input data should be of type np.float32. Alternatively, you can change the Numba signatures to accept, e.g., np.float64.

Normalisation

Unlike ROCKET, MINIROCKET does not require the input time series to be normalised. (However, whether or not it makes sense to normalise the input time series may depend on your particular application.)

Examples

MINIROCKET

from minirocket import fit, transform
from sklearn.linear_model import RidgeClassifierCV

[...] # load data, etc.

# note:
# * input time series do *not* need to be normalised
# * input data should be np.float32

parameters = fit(X_training)

X_training_transform = transform(X_training, parameters)

classifier = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10), normalize = True)
classifier.fit(X_training_transform, Y_training)

X_test_transform = transform(X_test, parameters)

predictions = classifier.predict(X_test_transform)

MINIROCKET_DV

from minirocket_dv import fit_transform
from minirocket import transform
from sklearn.linear_model import RidgeClassifierCV

[...] # load data, etc.

# note:
# * input time series do *not* need to be normalised
# * input data should be np.float32

parameters, X_training_transform = fit_transform(X_training)

classifier = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10), normalize = True)
classifier.fit(X_training_transform, Y_training)

X_test_transform = transform(X_test, parameters)

predictions = classifier.predict(X_test_transform)

PyTorch / 10,000+ Training Examples

from softmax import train, predict

model_etc = train("InsectSound_TRAIN_shuffled.csv", num_classes = 10, training_size = 22952)
# note: 22,952 = 25,000 - 2,048 (validation)

predictions, accuracy = predict("InsectSound_TEST.csv", *model_etc)

Variable-Length Input (Experimental)

from minirocket_variable import fit, transform, filter_by_length
from sklearn.linear_model import RidgeClassifierCV

[...] # load data, etc.

# note:
# * input time series do *not* need to be normalised
# * input data should be np.float32

# special instructions for variable-length input:
# * concatenate variable-length input time series into a single 1d numpy array
# * provide another 1d array with the lengths of each of the input time series
# * input data should be np.float32 (as above); lengths should be np.int32

# optionally, use a different reference length when setting dilation (default is
# the length of the longest time series), and use fit(...) with time series of
# at least this length, e.g.:
# >>> reference_length = X_training_lengths.mean()
# >>> X_training_1d_filtered, X_training_lengths_filtered = \
# >>> filter_by_length(X_training_1d, X_training_lengths, reference_length)
# >>> parameters = fit(X_training_1d_filtered, X_training_lengths_filtered, reference_length)

parameters = fit(X_training_1d, X_training_lengths)

X_training_transform = transform(X_training_1d, X_training_lengths, parameters)

classifier = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10), normalize = True)
classifier.fit(X_training_transform, Y_training)

X_test_transform = transform(X_test_1d, X_test_lengths, parameters)

predictions = classifier.predict(X_test_transform)

Acknowledgements

We thank Professor Eamonn Keogh and all the people who have contributed to the UCR time series classification archive. Figures in our paper showing mean ranks were produced using code from Ismail Fawaz et al. (2019).

🚀_🚀_{_🚀}

minirocket's People

Contributors

Stargazers

Watchers

minirocket's Issues

is the code handle the padding right?

is the input X, user may self add padding?

TypeError: No matching definition for argument type(s) array(float64, 2d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)

Thank you very much, once again, for this great piece of software. Very much appreciated! I'm trying to use it with my data but unfortunately, I always get the following error if I attempt to fit my input with "parameters = fit(x_trainScaled)":

TypeError: No matching definition for argument type(s) array(float64, 2d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)

Here are some, probably, relevant characteristics of my input:

    print(x_trainScaled.shape)
    print(x_trainScaled.dtype)

returns:

(3000, 3000)
float64

// edit:

This is the whole traceback:

  File "minirocket\code\minirocket.py", line 130, in fit
    biases = _fit_biases(X, dilations, num_features_per_dilation, quantiles)
  File "\lib\site-packages\numba\dispatcher.py", line 500, in _explain_matching_error
    raise TypeError(msg)

Unlabeled data

hello, thanks for your excellent work.
wmm, and I have a problem, I find the response in "starting with "wide" data", you say the data can be unlabeled, it depends on my task "(You don't need labels necessarily, depending on your task.)"
and when I read your article or code readme, I notice that you mentioned the parameters in different data are same, right? (ok, I don't know if I understand right, and I can't find where is the latter information.)
So my question is, could I apply your work on my unlabeled data? if it's true, how can I set the "Y_traing" in examples codes?
thanks!

Are minirocket.py and minirocket_dv.py the implementation for the same Mini-Rocket algorithm?

Accuracy problem

I am trying to RUN MiniRocket. The accuracy I get is 85% and it is 2% different from your accuracy. I have used the code you provided and defined only 109 UCR datasets for it. You can see the relevant code in the link below. What is the reason for this 2% difference?
https://colab.research.google.com/drive/1YcrWTSF7oNqGeP-C0n-pdi2EzAqYo-_g?usp=sharing

Feature Transformation

Hello,

I am trying to run MiniRocket on my dataset, which is basically a SCADA dataset containing data from multiple sensors over period of time. Its a multivariate time series therefore I am using multivariate version of MiniRocket from sklearn. However, the features are not being transformed the way they are supposed to be.

Initially, I ran the following chunk of code on my personal SCADA dataset:

minirocket_multi = MiniRocketMultivariate()
X_train_transform = minirocket_multi.fit_transform(X_train)
X_test_transform = minirocket_multi.transform(X_test)

This is the output that I am getting,

----------------------Before Transformation------------------------------
X_train: (34992, 25)
X_test: (17472, 25)
----------------------After Transformation------------------------------
X_train: (1, 9996)
X_test: (1, 9996)

However, I think after transformation the shape X_train and X_test should be (34992, 9996) and (17472, 9996). Could you please help me in this regard? Why is just transforming one single sample, not the rest?

Also, I would like to mention that I have loaded data as using pickle file, containing data in form of pandas dataframe.

with open(train_file, "rb") as f:
data_train=pickle.load(f)
X_train_wt = data_train.iloc[:, :-1]
y_train_wt = data_train.iloc[:, -1] # Last column

if the kernel numbers is much less than dataset sample numbers

if the kernel numbers is much less than dataset sample numbers, will it Limit model effects?

padding problem

For non-stationary queues, it might seem a bit odd if padding is done solely with zeros. Would it be better to fill in the values of the start point and end point instead?

minirocket_multivariate extremely slow

My setup is that I am using large dataset (10,000+) and I pass data as batches into model. I do not cache the data and run transform every time I pass data into model on every epoch. I run this same setup for both

minirocket.py with input shape (32768,99) and

minirocket_multivariate.py with input shape (32768,1,99) so the number of channel is 1.

I find that the minirocket_multivariate.py version runs significantly more slow on every transform() relative to minirocket.py.

Is there a potential bug in the code?

Can't set random_state when doing a gridsearchCV

Dependencies

import numpy as np
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sktime.datasets import load_basic_motions
from sktime.transformations.panel.rocket import MiniRocketMultivariate`

Make train/test split and set up pipeline

X_train, y_train = load_basic_motions(split="train", return_X_y=True)

model = Pipeline([
    ('minirocket', MiniRocketMultivariate(random_state=42)), 
    ('ridge_clf', RidgeClassifier(random_state=42)),
])

Fit 1 model

model.fit(X_train, y_train)
Works fine

Now do a gridsearch for alpha value

parameters = {
  'ridge_clf__alpha': [0.1, 1, 10],
}

model_cv = GridSearchCV(model, parameters)

model_cv.fit(X_train, y_train)

"RuntimeError: Cannot clone object MiniRocketMultivariate(random_state=42), as the constructor either does not set or modifies parameter random_state"

Port to R language

Hi,

I was wondering if you plan to port minirocket to R language.

Many thanks,

starting with "wide" data

If I start with the wide data format, a 2d array of samples (rows) by sensor readings (columns), what is the right way to transform that to fit the requirements of this library?

Can't reproduce the accuracy of 'EOGHorizontalSignal' dataset in UCR109 based on original 'MINIROCKET' exmples

Hi, as the title suggests, code as follow:

`from minirocket import fit, transform
import pandas as pd
import numpy as np
from sklearn.linear_model import RidgeClassifierCV

csv_root = "/media/data1/ubuntu_env/data/TSC_datasets/Univariate_ts2csv/"
subdataset_name = "EOGHorizontalSignal"
train_csv_file_path = csv_root + subdataset_name + '/{}_TRAIN.csv'.format(subdataset_name)
test_csv_file_path = csv_root + subdataset_name + '/{}_TEST.csv'.format(subdataset_name)

train_csv_file = pd.read_csv(train_csv_file_path,
header = None,
sep = ",",
skiprows = 0,
engine = "c")
test_csv_file = pd.read_csv(test_csv_file_path,
header = None,
sep = ",",
skiprows = 0,
engine = "c")
total_train_data = train_csv_file.values[:].copy()
X_training, Y_training = total_train_data[:, 1:].astype(np.float32), total_train_data[:, 0].astype(np.int32)
total_test_data = test_csv_file.values[:].copy()
X_test, Y_test = total_test_data[:, 1:].astype(np.float32), total_test_data[:, 0].astype(np.int32)

note:

* input time series do not need to be normalised

* input data should be np.float32

parameters = fit(X_training)
X_training_transform = transform(X_training, parameters)
X_test_transform = transform(X_test, parameters)

classifier = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10))
classifier.fit(X_training_transform, Y_training)
predictions = classifier.predict(X_test_transform)
total = len(Y_test)
correct = (predictions == Y_test).sum()
accuracy = correct / total
print("accuarcy: {}".format(accuracy))`

Based on the code above, the final acc is about 0.59~0.60. But the result to be displayed as follow is 0.83:

I have validated the data loading on several other datasets, accuracy appeared to be normal, so i think data loading in my code should be ok. There is a big difference between 0.58 from 0.80, and i really wounder what's the problem which lead to it. Thanks!

pls help to understand how your this repo code related to sktime.transformers.series_as_features.rocket

pls help to understand how your this repo code related to sktime.transformers.series_as_features.rocket
as written in
https://towardsdatascience.com/minirocket-fast-er-and-accurate-time-series-classification-cdacca2dcbfa
from sktime.transformers.series_as_features.rocket import MiniRocket

is it the same code?

how to use minirocket in production

Hi,
I would like to ask, how to use minirocket for production or implementation phase. is there any way to save minirocket that was fitted in training data and use it for new dataset?

thank you

Need example to use variable ROCKET

Hi, would you mind providing examples to use minirocket_variable and minirocket_multivariate_variable? I am not sure on how to configure the required data input.

thank you

datatype

when i use my data with minirocket in pycharm , had a problem with dataype, like: Traceback (most recent call last):
File "E:/PycharmProjects/minirocket-main/code/traintest.py", line 46, in
parameters = fit(X_training)
File "E:\PycharmProjects\minirocket-main\code\minirocket.py", line 130, in fit
biases = _fit_biases(X, dilations, num_features_per_dilation, quantiles)
File "E:\ProgramData\Anaconda3\envs\deepl\lib\site-packages\numba\core\dispatcher.py", line 703, in _explain_matching_error
raise
TypeError(msg)TypeError: No matching definition for argument type(s) pyobject, array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)
how can i work it

Minimum length time series

Hi, What is the minimum length of a time series for Minirocket? I have tried with time series of length 4 but It throws me an error

is is necessary to normalize data when use minirocket multivariable version?

Channels get lost

Dear all,

I am applying Minirocket to a set of multivariate series with 7 channels and 8020 data points.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from minirocket_multivariate import fit, transform
import numpy as np

# Asume que df_final_extendedVelocidadAnglarDerivadaRadio ya está preparado con tus datos
X = df_final_extendedVelocidadAnglarDerivadaRadio.drop(columns=['etiqueta']).values
y = df_final_extendedVelocidadAnglarDerivadaRadio['etiqueta'].values

# Reforma X para que tenga la forma esperada por MiniRocket
X_reshaped = X.reshape(-1, 7, 8020)
X_reshaped = X_reshaped.astype(np.float32)

# Ajusta los parámetros de MiniRocket
params = fit(X_reshaped, num_features=100, max_dilations_per_kernel=84)

# Transforma los datos usando MiniRocket
X_transformed = transform(X_reshaped, params)

When I print X_reshaped.shape I get: (240, 7, 8020)

However the transformation using minirocket_multivariate.fit() returns a X_transformed with dimensions (240, 84). I would have expected (240, 7, 84). Is my assumption correct? If so, am I doing anything wrong? Your help will be highly appreciated.

Best,
Luis

Extending Documentation of minirocket multivariate

Hello,

The implementations for minirocket multivariate (both here and on sktime) mention that it is a naive extension of the univatiate version, but do not give any clearer explanation of what is actually happening under the hood. Looking directly at the source code for this version does not help that much either, as it is fairly hard to read.

Could you extend the documentation on the repository with a (coarse) description of how the algorithm was extended to handle multivariate data and/or add some comments to the source code in that regard?

Thanks!

Feature Size

Thank you so much for making your work available! I have a quick question about the feature size. Looks like the minimum number of feature size is 84. Is there any harm in extracting 84 features and using only a subset them?

X_validation not transformed properly?

hi, for softmax.py, if the data is split into multiple chunks, then X_validation is only transformed for the first's chunk biases, as biases for different chunks are different, but the transform is only applied once.

if epoch == 0 and chunk_index == 0: # only run once <---

   parameters = fit(X_training, args["num_features"]) # returns: dilations, num_features_per_dilation, biases

   # transform validation data
   X_validation_transform = transform(X_validation, parameters)

would transforming the X_validation with each chunk's biases improve performance?

EDIT:

similarly for the latter part (where X_validation_transform is only normalised with mean and std values from the first chunk):

if epoch == 0 and chunk_index == 0:

                    # per-feature mean and standard deviation
                    f_mean = X_training_transform.mean(0)
                    f_std = X_training_transform.std(0) + 1e-8

                    # normalise validation features
                    X_validation_transform = (X_validation_transform - f_mean) / f_std
                    X_validation_transform = torch.FloatTensor(X_validation_transform)

is there any Tensorflow implementation of minirocket?

It would be handy to have minirocket as a Tensorflow layer (and potentially have it use the GPU if that's better in that scenario).

Datatype Error

When I tried to run MiniRocket on my data in Python, I got the following datatype error: "TypeError: X must be in an sktime compatible format, of scitype Series, Panel or Hierarchical, for instance a pandas.DataFrame with sktime compatible time indices, or with MultiIndex and last(-1) level an sktime compatible time index. Allowed compatible mtype format specifications are: ['pd.Series', 'pd.DataFrame', 'np.ndarray', 'nested_univ', 'numpy3D', 'pd-multiindex', 'df-list', 'pd_multiindex_hier'] See the data format tutorial examples/AA_datatypes_and_datasets.ipynb, If you think the data is already in an sktime supported input format, run sktime.datatypes.check_raise(data, mtype) to diagnose the error, where mtype is the string of the type specification you want."

Any help for minirocket on UEA multivariate time series classification

hello， has any result report on minirocket on UEA multivariate time series classification archive？ @angus924
I use the minirocket_multivariate to handle PenDigits dataset in UEA multivariate，but there is NaN in X_training_transform.
And the result on UEA is poor compared to the result of minirocket_dv on UCRArchive_2018, can give me some suggestion?
Code:
parameters = fit(X_training,num_features = 10_000)
X_training_transform = transform(X_training, parameters)
print('X_training_transform:',X_training_transform)
print('type(X_training_transform):',type(X_training_transform))
print("X_training_transform.shape:", X_training_transform.shape)
print("np.isnan(X_training_transform).any():", np.isnan(X_training_transform).any())
classifier = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10), normalize = True)
classifier.fit(X_training_transform, Y_training)
X_test_transform = transform(X_test, parameters)
predictions = classifier.predict(X_test_transform)
Report:
last_X_training.shape: (7494, 2, 8)
last_X_test.shape: (3498, 2, 8)
last_Y_training.shape: (7494,)
last_Y_test.shape: (3498,)
X_training_transform: [[0. 0. 0. ... 0.625 0.875 0.375]
[0. 0. 0. ... 0.625 1. 0.125]
[0. 0. 0. ... 0.375 0.625 0.25 ]
...
[0. 0. 0. ... 0.375 0.875 0.125]
[0. 0. 0. ... 0.25 1. 0.125]
[0. 0. 0. ... 0.5 0.875 0.125]]
type(X_training_transform): <class 'numpy.ndarray'>
X_training_transform.shape: (7494, 9996)
np.isnan(X_training_transform).any(): True
Traceback (most recent call last):
File "cc-test.py", line 68, in
classifier.fit(X_training_transform, Y_training)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/linear_model/_ridge.py", line 1943, in fit
multi_output=True, y_numeric=False)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/utils/validation.py", line 878, in check_X_y
estimator=estimator)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/utils/validation.py", line 721, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/newhome/chenc/miniforge3/envs/AIcocahing/lib/python3.6/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

some question about multivarible version.

hello, I watch the code about multivarible miniroket. I think the combine multi channels is not make sense for me.
Conv(x) , x is channel 0
Conv(y), y is channel 1
when combine the channel, just become:
Conv(x+y)
why not, change the np.sum to np.prod.
Conv(x*y)

Example of CSV file reading

Hello, I'm trying to figure out what minirocket expects as data on input. I keep on getting
TypeError: No matching definition for argument type(s) pyobject, array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)

My data has following format:

timestamp,close
1619773130596,54559.47
1619773134938,54563.93
1619773139226,54554.23
1619773143564,54564.34

And I read it like this:

dataset = pd.read_csv(filename, usecols = [0, 1], header=0)
dataset = dataset.dropna()
dataset.columns = dataset.columns.to_series().apply(lambda x: x.strip())

download 10,000+ Training Examples

hi, where can I download the MosquitoSound, InsectSound, and FruitFlies datasets?