Coder Social home page Coder Social logo

baraline / convst Goto Github PK

View Code? Open in Web Editor NEW
31.0 3.0 7.0 12.74 MB

Implementation of the Random Dilated Shapelet Transform algorithm along with interpretability tools. ReadTheDocs documentation is not up to date with the current version for now.

Home Page: https://convst.readthedocs.io/en/latest/

License: BSD 2-Clause "Simplified" License

Python 99.57% Makefile 0.43%
time-series-classification shapelet-transform shapelets ucr-archive algorithm series univariate paper convolutions python

convst's Introduction

This package is moving to the aeon-toolkit.

Starting from v0.3.0, this package will not be updated, bugfixes will still be included if issues are raised. You can already find RDST in the Aeon package at https://github.com/aeon-toolkit/ . Further improvements are planned for further speeding up RDST, these improvement will only be implemented in aeon.

ALL FUNCTIONALITIES OF THIS PACKAGE OUTSIDE OF THE INTEPRETER ARE NOW PORTED INTO AEON FROM V0.6.0, PLEASE REFER TO THE AEON IMPLEMENTATION WHEN DOING EXPERIMENTS.

AN EXAMPLE NOTEBOOK ON HOW TO CORRECTLY INTERPRET SHAPELETS FROM RDST IS PLANNED (see aeon-toolkit/aeon#973)

If these functionnalities are what you need, I highly recommend that you use aeon as I spent more time on the aeon implementation and tests compared to convst.

Readme

Welcome to the convst repository. It contains the implementation of the Random Dilated Shapelet Transform (RDST) along with other works in the same area. This work was supported by the following organisations:

Status

Overview
Compatibility !python-versions
CI/CD !pypi docs build
Code Quality lines CodeFactor
Downloads Downloads

Installation

The recommended way to install the latest stable version is to use pip with pip install convst. To install the package from sources, you can download the latest version on GitHub and run python setup.py install. This should install the package and automatically look for the dependencies using pip.

We recommend doing this in a new virtual environment using anaconda to avoid any conflict with an existing installation. If you wish to install dependencies individually, you can see dependencies in the setup.py file.

An optional dependency that can help speed up numba, which is used in our implementation, is the Intel vector math library (SVML). When using conda it can be installed by running conda install -c numba icc_rt. I didn't test the behavior with AMD processors, but I suspect it won't work.

Tutorial

We give here a minimal example to run the RDST algorithm on any dataset of the UCR archive using the aeon API to get datasets:

from convst.classifiers import R_DST_Ridge
from convst.utils.dataset_utils import load_UCR_UEA_dataset_split

X_train, X_test, y_train, y_test, _ = load_UCR_UEA_dataset_split('GunPoint')

# First run may be slow due to numba compilations on the first call. 
# Run a small dataset like GunPoint if this is the first time you call RDST on your system.
# You can change n_shapelets to 1 to make this process faster. The n_jobs parameter can
# also be changed to increase speed once numba compilation are done.

rdst = R_DST_Ridge(n_shapelets=10_000, n_jobs=1).fit(X_train, y_train)
print("Accuracy Score for RDST : {}".format(rdst.score(X_test, y_test)))

If you want a more powerful model, you can use R_DST_Ensemble as follows (note that additional Numba compilation might be needed here):

from convst.classifiers import R_DST_Ensemble

rdst_e = R_DST_Ensemble(
  n_shapelets_per_estimator=10_000,
  n_jobs=1
).fit(X_train, y_train)
print("Accuracy Score for RDST : {}".format(rdst_e.score(X_test, y_test)))

You can obtain faster result by using more jobs and even faster, at the expense of some accuracy, with the prime_dilation option:

rdst_e = R_DST_Ensemble(
  n_shapelets_per_estimator=10_000,
  prime_dilations=True,
  n_jobs=-1
).fit(X_train, y_train)

print("Accuracy Score for RDST : {}".format(rdst_e.score(X_test, y_test)))

You can also visualize a shapelet using the visualization tool to obtain such visualization :

Example of shapelet visualization

To know more about all the interpretability tools, check the documentation on readthedocs.

Supported inputs

RDST support the following type of time series:

  • Univariate and same length
  • Univariate and variable length
  • Multivariate and same length
  • Multivariate and variable length

We use the standard scikit-learn interface and expect as input a 3D numpy array of shape (n_samples, n_features, n_timestamps). For variable length input, we expect a (python) list of numpy arrays, or a numpy array with object dtype.

Reproducing the paper results

Multiple scripts are available under the PaperScripts folder. It contains the exact same scripts used to generate our results, notably the test_models.py file, used to generate the csv results available in the Results folder of the archive.

Contributing, Citing and Contact

If you are experiencing bugs in the RDST implementation, or would like to contribute in any way, please create an issue or pull request in this repository. For other question or to take contact with me, you can email me at [email protected]

If you use our algorithm or publication in any work, please cite the following paper (ArXiv version https://arxiv.org/abs/2109.13514):

@InProceedings{10.1007/978-3-031-09037-0_53,
author="Guillaume, Antoine
and Vrain, Christel
and Elloumi, Wael",
title="Random Dilated Shapelet Transform: A New Approach for Time Series Shapelets",
booktitle="Pattern Recognition and Artificial Intelligence",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="653--664",
abstract="Shapelet-based algorithms are widely used for time series classification because of their ease of interpretation, but they are currently outperformed by recent state-of-the-art approaches. We present a new formulation of time series shapelets including the notion of dilation, and we introduce a new shapelet feature to enhance their discriminative power for classification. Experiments performed on 112 datasets show that our method improves on the state-of-the-art shapelet algorithm, and achieves comparable accuracy to recent state-of-the-art approaches, without sacrificing neither scalability, nor interpretability.",
isbn="978-3-031-09037-0"
}

To cite the RDST Ensemble method, you can cite the PhD thesis where it is presented as (soon to be available, citation format may change):

@phdthesis{Guillaume2023,
  author="Guillaume, Antoine", 
  title="Time series classification with Shapelets: Application to predictive maintenance on event logs",
  school="University of Orléans",
  year="2023",
  url="https://www.theses.fr/s265104"
}

TODO for relase 1.0:

  • Finish Numpy docs in all python files
  • Update documentation and examples
  • Enhance interface for interpretability tools
  • Add the Generalised version of RDST
  • Continue unit tests and code coverage/quality

Citations

Here are the code-related citations that were not made in the paper

[1]: The Scikit-learn development team, "Scikit-learn: Machine Learning in Python", Journal of Machine Learning Research 2011

[2]: The Numpy development team, "Array programming with NumPy", Nature 2020

convst's People

Contributors

baraline avatar code-factor avatar lgtm-migrator avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

convst's Issues

Make public code used to draw diagrams

  • Create a new util file, containing the private code used to generate critical difference and pairwise plots.
  • Add examples to the documentation and links

RDST parallelism KeyError

During some, but not all, runs (e.g. FordA / FordB datasets) RDST Ensemble classifier fails with the following error dump :

joblib.externals.loky.process_executor._RemoteTraceback: 
Traceback (most recent call last):
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 469, in save
    data_name = overloads[key]
KeyError: ((array(float64, 3d, C), Tuple(array(float64, 2d, C), array(int64, 1d, C), array(int64, 1d, C), array(float64, 1d, C), array(bool, 1d, C)), type(CPUDispatcher(<function manhattan at 0x7fce883fe0d0>)), bool), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,-amx-bf16,-amx-int8,-amx-tile,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vp2intersect,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-serialize,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-tsxldtrk,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('00e465fe82fb9c04ee9ece12d3d459a0d4fe0a0d451df090bccee8dc666d02b2', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 117, in __call__
    return self.function(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/classifiers/rdst_ensemble.py", line 56, in _parallel_fit
    return model.fit(X, y)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/base.py", line 870, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/transformers/rdst.py", line 270, in transform
    X_new = self.transformer(
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 487, in _compile_for_args
    raise e
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 420, in _compile_for_args
    return_val = self.compile(tuple(argtypes))
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 972, in compile
    self._cache.save_overload(sig, cres)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 652, in save_overload
    self._save_overload(sig, data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 662, in _save_overload
    self._cache_file.save(key, data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 478, in save
    self._save_index(overloads)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 522, in _save_index
    data = self._dump(data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 550, in _dump
    return dumps(obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/serialize.py", line 57, in dumps
    p.dump(obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/cloudpickle/cloudpickle_fast.py", line 568, in dump
    return Pickler.dump(self, obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/types/functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")
ReferenceError: underlying object has vanished

Helper function for numba compilations

Design a helper function to run all numba compilations with very small data to avoid issues with first-time uses.

  • Helper function must compile all possible function signatures (32/64)
  • Ideally, must be run during the installation process (is it possible ?), not during imports to avoid repeating it. If not, provide it as part of the API
  • Update documentations and readme to mention this function

unexpected keyword argument 'n_jobs_rdst'

When running test_on_arff_resamples.py, the following error occurred. How to solve it?
PaperScripts/test_on_arff_resamples.py:None (PaperScripts/test_on_arff_resamples.py) test_on_arff_resamples.py:54: in <module> pipeline_RDST_rdg = model_class(n_jobs=3, n_jobs_rdst=100//3) E TypeError: __init__() got an unexpected keyword argument 'n_jobs_rdst'
'

[BUG] v0.1.5.2 numba function get_subsequence return nan values

In some rare edge cases (e.g., on ACSF1), _get_subsequence function in RDST transform may return nan values after normalization, due to the standard deviation being computed as nan. This was unseen in the experiments, as the StandardScaler is dealing with the nan values before the Ridge classifier.

  • Fix implementation of _get_subsequence to avoid nan value on standard deviation.

Alpha similarity mask for multivariate time series

In the multivariate transforms, the alpha similarity mask of shape (2,n_dilations,n_samples,n_features,n_timestamps) is shared independently of the feature subset used. A shapelet using features 1,2,3 and another features 2,3,4 will then share 2/3 of the mask information, which may be restrictive for selecting some interesting patterns.

  • Change the alpha similarity mask logic for multivariate time series

[BUG] Alpha similarity with multiple input lengths.

The current formula for alpha similarity was not adapted to multiple length shapelets. This does not cause any exception, but can harm the shapelet sampling process.

  • Change alpha similarity in all transformers to $max(1,(1-\alpha) min(L))$

Additional bug with multiple lengths:

  • Debug tuple creation for length and dilation parameters in transformers

Dependencies update for Python 3.11

When installing convst for Python 3.11, error occurs due to some dependencies version not matching with the absence of upper bound version for dependencies.

  • Update lower dependencies version
  • Add upper version
  • Add Python 3.11 to test pipeline
  • Add deprecations check in test pipeline (Is it possible to make test for both lower and upper versions specified ?)
  • Move test, doc and graphic dependencies to optional dependencies

Adapt API to sktime 0.11+

Notes to self :

  • Some changes in the sktime API with 0.11+, along with deprecation warnings to fix
  • Check required version is still up-to-date

Potential impacts :

  • Dataset loading utilities
  • Some utils function for input checks
  • Some models API used in the experiments for the paper

SystemError: _PyEval_EvalFrameDefault returned a result with an error set

Working on multivariate time series, with shape (6202x2x33), when running the interpreter model, I am getting this error
SystemError: _PyEval_EvalFrameDefault returned a result with an error set
Please let me know if you need more information
With univariate time series, I dont see the error

[BUG] Multivariate channel initialisation on v0.2.6

In some case, shapelet initialization for multivariate time series return mostly channel 0. The problem is removed by setting Numba parallel to False.

  • Find the origin of the bug in the multivariate initialization with parallel option to True.

[BUG] BadZipFile and ValueError on Wafer Dataset

Describe the bug
A clear and concise description of what the bug is.
There is data loading error in 'Wafer' dataset, while 'GunPoint' is ok.
image

To Reproduce
Steps to reproduce the behavior:
You can reproduce the error by running following file.
https://1drv.ms/f/s!AuU5Lmr0utymk4IKdA1WNoxzkD0tNg

Expected behavior
A clear and concise description of what you expected to happen.
Accuracy should be 1 on wafer dataset.

Code example
If applicable, add code example to help explain your problem.
Same as above

Environment (please complete the following information):

  • OS: [e.g. iOS] Google colab
  • Version of the convst package [e.g. 0.15] : Latest (0.3.0)

Additional context
Add any other context about the problem here.

Improve docs

Add the video presentation made at ICPRAI to the Readme page and readthedocs as it is a good introduction to Shapelets and the components introduced by RDST.

  • Compile all video slides into one
  • Upload the video (is it possible to integrate it directly in readme ? Or link to a video hosting)
  • Add dedicated section to readme and docs

Normalization for Ridge pipelines

The performance on ASCF1 is very low in a cross validation setting for RDST + Ridge, behaviour changed after deprecation of the normalize argument for Ridge classifier.
TODO :
Check wheter or not the "with_mean" argument on StandardScaler should be step to False. Changing it to True seems to solve the issue.

More options for shapelet sampling

In some use case, it might be interesting to only sample shapelets from a subset of the input dataset. For example in an imbalanced binary classification context, if class 0 represent 99% of samples, the majority of shapelets will be sampled from class 0 with the actual random sampling scheme. It might be more interesting to sample shapelets only from class 1 or to downsample class 0.

While it is simple to add this as a step before doing an RDST transform, it would require reimplementing or hacking through the RDST classifiers.

  • Replace the current n_samples parameter of RDST to allow different sampling options, such as :

  • Only sample shapelets from a specified subset of the data (boolean mask on samples and/or timestamps)

  • String options such as "balanced"

  • Current option with float to indicate a percentage for up/downsampling

  • Update API and docs of the rdst transformer and linked classifiers. Clarify which parameter (e.g class_weight) impact the classifier and which impact the shapelet sampling.

  • Add unit tests for each options

Update results folder with latest results

The results folder is not up-to-date with the latest results, fetch CSV files and replace old ones. CD figure is already updated, check for other figures to update.

Prime dilation slower on some cases

In some cases, using prime_dilation=True is slower than prime_dilation=False for RDST Ensemble. This can happen for example on the Rock dataset.

  • Investigate the issue (it only happens for Ensemble)

extract distances and transformed shapelets from rdst buildt model

How do i extract distance vector or transformed shapelets for each of my time series?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Shapelet and TS extraction

Hello,

I'm trying to use the RDST implementation for TSC task, however, I'm interested in the interpretability of the method. So, I would like to confirm what it is the correct way to make the extraction of the shapelets of the model and the time series that generate them.

Thanks in advance,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.