Coder Social home page Coder Social logo

sintel-dev / orion Goto Github PK

View Code? Open in Web Editor NEW
970.0 30.0 156.0 34.04 MB

A machine learning library for detecting anomalies in signals.

Home Page: https://sintel.dev/Orion/

License: MIT License

Makefile 3.09% Python 96.36% Dockerfile 0.29% Shell 0.26%
anomaly-detection deep-learning machine-learning time-series benchmarking signals unsupervised-learning orion data-science generative-adversarial-network

orion's Introduction

“DAI-Lab” An open source project from Data to AI Lab at MIT.

“Orion”

Development Status Python PyPi Shield Tests Downloads Binder

Orion

A machine learning library for unsupervised time series anomaly detection.

Important Links
💻 Website Check out the Sintel Website for more information about the project.
📖 Documentation Quickstarts, User and Development Guides, and API Reference.
Tutorials Checkout our notebooks
:octocat: Repository The link to the Github Repository of this library.
📜 License The repository is published under the MIT License.
Community Join our Slack Workspace for announcements and discussions.

Overview

Orion is a machine learning library built for unsupervised time series anomaly detection. With a given time series data, we provide a number of “verified” ML pipelines (a.k.a Orion pipelines) that identify rare patterns and flag them for expert review.

The library makes use of a number of automated machine learning tools developed under Data to AI Lab at MIT.

Read about using an Orion pipeline on NYC taxi dataset in a blog series:

Part 1: Learn about unsupervised time series anomaly detection Part 2: Learn how we use GANs to solving the problem? Part 3: How does one evaluate anomaly detection pipelines?

Notebooks: Discover Orion through colab by launching our notebooks!

Quickstart

Install with pip

The easiest and recommended way to install Orion is using pip:

pip install orion-ml

This will pull and install the latest stable release from PyPi.

In the following example we show how to use one of the Orion Pipelines.

Fit an Orion pipeline

We will load a demo data for this example:

from orion.data import load_signal

train_data = load_signal('S-1-train')
train_data.head()

which should show a signal with timestamp and value.

    timestamp     value
0  1222819200 -0.366359
1  1222840800 -0.394108
2  1222862400  0.403625
3  1222884000 -0.362759
4  1222905600 -0.370746

In this example we use aer pipeline and set some hyperparameters (in this case training epochs as 5).

from orion import Orion

hyperparameters = {
    'orion.primitives.aer.AER#1': {
        'epochs': 5,
        'verbose': True
    }
}

orion = Orion(
    pipeline='aer',
    hyperparameters=hyperparameters
)

orion.fit(train_data)

Detect anomalies using the fitted pipeline

Once it is fitted, we are ready to use it to detect anomalies in our incoming time series:

new_data = load_signal('S-1-new')
anomalies = orion.detect(new_data)

⚠️ Depending on your system and the exact versions that you might have installed some WARNINGS may be printed. These can be safely ignored as they do not interfere with the proper behavior of the pipeline.

The output of the previous command will be a pandas.DataFrame containing a table of detected anomalies:

        start         end  severity
0  1402012800  1403870400  0.122539

Leaderboard

In every release, we run Orion benchmark. We maintain an up-to-date leaderboard with the current scoring of the verified pipelines according to the benchmarking procedure.

We run the benchmark on 12 datasets with their known grounth truth. We record the score of the pipelines on each datasets. To compute the leaderboard table, we showcase the number of wins each pipeline has over the ARIMA pipeline.

Pipeline Outperforms ARIMA
AER 11
TadGAN 7
LSTM Dynamic Thresholding 8
LSTM Autoencoder 7
Dense Autoencoder 7
VAE 6
LNN 7
Matrix Profile 5
GANF 5
Azure 0

You can find the scores of each pipeline on every signal recorded in the details Google Sheets document. The summarized results can also be browsed in the following summary Google Sheets document.

Resources

Additional resources that might be of interest:

Citation

If you use AER for your research, please consider citing the following paper:

Lawrence Wong, Dongyu Liu, Laure Berti-Equille, Sarah Alnegheimish, Kalyan Veeramachaneni. AER: Auto-Encoder with Regression for Time Series Anomaly Detection.

@inproceedings{wong2022aer,
  title={AER: Auto-Encoder with Regression for Time Series Anomaly Detection},
  author={Wong, Lawrence and Liu, Dongyu and Berti-Equille, Laure and Alnegheimish, Sarah and Veeramachaneni, Kalyan},
  booktitle={2022 IEEE International Conference on Big Data (IEEE BigData)},
  pages={1152-1161},
  doi={10.1109/BigData55660.2022.10020857},
  organization={IEEE},
  year={2022}
}

If you use TadGAN for your research, please consider citing the following paper:

Alexander Geiger, Dongyu Liu, Sarah Alnegheimish, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. TadGAN - Time Series Anomaly Detection Using Generative Adversarial Networks.

@inproceedings{geiger2020tadgan,
  title={TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks},
  author={Geiger, Alexander and Liu, Dongyu and Alnegheimish, Sarah and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={2020 IEEE International Conference on Big Data (IEEE BigData)},
  pages={33-43},
  doi={10.1109/BigData50022.2020.9378139},
  organization={IEEE},
  year={2020}
}

If you use Orion which is part of the Sintel ecosystem for your research, please consider citing the following paper:

Sarah Alnegheimish, Dongyu Liu, Carles Sala, Laure Berti-Equille, Kalyan Veeramachaneni. Sintel: A Machine Learning Framework to Extract Insights from Signals.

@inproceedings{alnegheimish2022sintel,
  title={Sintel: A Machine Learning Framework to Extract Insights from Signals},
  author={Alnegheimish, Sarah and Liu, Dongyu and Sala, Carles and Berti-Equille, Laure and Veeramachaneni, Kalyan},  
  booktitle={Proceedings of the 2022 International Conference on Management of Data},
  pages={1855–1865},
  numpages={11},
  publisher={Association for Computing Machinery},
  doi={10.1145/3514221.3517910},
  series={SIGMOD '22},
  year={2022}
}

orion's People

Contributors

alexandergeiger avatar ban2aru avatar csala avatar dailab-bot avatar dyuliu avatar hector-hedb12 avatar hramir avatar kronerte avatar kveerama avatar lcwong0928 avatar manuelalvarezc avatar micahjsmith avatar pvk-developer avatar sarahmish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

orion's Issues

Intermediate outputs of pipeline

Description

We want to store intermediate outputs of the pipeline. The user might want to store more information about the datarun besides just the found anomalies.

Approach

We add another collection IO to the DB, which has the fields datarun and io. datarun is a reference to the datarun that created the intermediate output and io is a field that stores a dictionary.
A user then could use the pipeline JSON and specify what to store, following the specifications introduced to MLBlocks here.

Once we have that feature of MLBlocks, we can make use of the intermediate outputs and store them as a dictionary, using primitive.variable_name as key.

The anomalies are stored in the events collection just like before, therefore those should be returned normally and not as intermediate outputs.

Add a TimeSeriesAnomalyDetector

There will be a single class in the estimators module. It will contain all the logic required to fit, score, tune and obtain predictions from a pipeline, it will contain the following methods:

  • __init__: The class constructor. It will receive the pipeline in the MLBlocks format, along with the cross-validation configuration and persistence options.

  • tune: Will use BTB and the cross-validation and scorer from the constructor to find the best hyperparameters for the given template.

  • fit: Will fit a pipeline using the hyperparameters found by the tune method or the ones given as an argument to the init method.

  • predict: Will use the pipeline to make predictions. It will raise an exception if the pipeline has not been fitted.

User Interface

A user interface class should be created (named Orion?) that provides the user-level functionality of the project.

  1. (Previous step): The user prepares primitives and a pipeline that uses them (JSON).
  2. The user creates the Orion class passing:
    • arguments needed to load the data
    • the pipeline to be used (JSON path?)
    • the DB connection details (or DB object)
  3. (optional) The user calls the tune method to tune the pipeline on the data.
  4. The user calls the evaluate method to compute a score for the pipeline (default or tuned).
  5. The user calls the save method to store the pipeline in the DB with all the required details.

Add the S3 location explicitly

At the moment, the command orion add dataset {name} expects an optional positional argument location which points at a local CSV.
If the location is not given or does not point at a local CSV, the name is assumed to be the name of one of the datasets hosted in Amazon S3.

Let's make this explicit, with the following format:

orion add dataset {name} s3://{bucket_name}/path/to/the.csv

Of example, the S-1 dataset could be added as:

orion add dataset S-1 s3://d3-ai-orion/S-1.csv

Optionally, a boolean flag --demo could be added so that one is allowed to execute orion add dataset S-1 --demo and skip the S3 bucket specification.

Add F0.5 score to metrics

Description

In the original paper a f0.5 score is used for evaluating the result. We might want to add that metric as well.
sklearn.metrics.fbeta_score with beta=0.5 should achieve that.

Threshold for unusually low errors

In the original implementation another threshold is used to detect unusually low errors.
The intuition is that unusually low errors are anomalous as well, as the prediction is more accurate than on average. This is done by 'flipping' the errors 'around the mean' and applying the existing threshold function again.
However, this improves the score just slightly, so we could make it a boolean parameter for the find_anomalies primitive whether to use the low-error threshold or not and therefore making it optional to use.

Rename Score to Severity

Currently, the analyze method returns a column called score, which indicates the severity of the anomaly.
This name is also in the Database Schema, currently as part a field of the Events class, possible elsewhere after #66 .

The problem with this name is that it can be confused with a goodness-of-fit score, such as accuracy or f1, so I want to propose changing it severity everywhere.

Larger prediction window

In the original implementation a prediction window of size 10 is used. However, only the first value of that prediction sequence is used for the error calculation.
I am not sure if it was done intentionally (instead of using a prediction window of size 1).
This predicted values of both methods differ a bit, however the overall score is not affected much, likely because the predictions and thus also the errors are changed throughout the signal, resulting in a similar (but in general a bit higher) error curve, to which the threshold is applied. So I am not sure if we need to implement that in our pipeline at the moment.

Rolling predictions

Description

We want to be able to train and predict our model several times in a moving manner, meaning we want to train a model and predict anomalies for the first n timestamps, move the window by some steps and repeat the procedure. Therefore we also need to remove any anomalies that we find before we train on that data in the next step.

Suggestion

We will introduce a new primitive that allows to specify intervals that should be dropped from the data. Then we can just use the found anomalies and exclude them from the signal while we iterate over it.

In Orion we could just specify the window_size that we want to use and modify the analyze method to iterate over the signal:

def analyze(pipeline, X):

    pipeline = _load_pipeline(pipeline)

    found_intervals = []
    found_events = []

    start = 0
    training_size = 2000
    testing_size = 2000

    while start < len(X) - training_size - testing_size:
        train_window = X[start:start+training_size]

        if start + testing_size < len(X) - training_size - testing_size:
            test_window = X[start + training_size - 250:start + training_size + testing_size]
        else:
            test_window = X[start + training_size - 250:]

        pipeline.fit(train_window, train_ind=True, intervals=found_intervals)
        events = pipeline.predict(test_window, train_ind=False, intervals=found_intervals)
        if len(events) > 0:
            for event in events:
                found_events.append(event)
                found_intervals.append((event[0], event[1]))

        start = start + testing_size

    if len(found_events) == 0:
        found_events = list()
        found_events.append([X.iloc[0]['timestamp'], X.iloc[0]['timestamp'], 0])

    found_events = pd.DataFrame(np.vstack(found_events), columns=['start', 'end', 'score'])
    found_events['start'] = found_events['start'].astype(int)
    found_events['end'] = found_events['end'].astype(int)

    return found_events

Rename evaluation to benchmark and metrics to evaluation

Currently, the orion package is organized as follows.

orion
├── evaluation.py 
└── metrics.py

With the new metrics subpackage, we would like to rename it to evaluation and update evaluation.py to benchmark.py.

orion
├── benchmark.py 
└── orion/evaluation -> used to be metrics.py
    ├── __init__.py 
    ├── common.py 
    ├── contextual.py 
    ├── point.py 
    └── utils.py

Adjust `epoch` meaning in Cyclegan primitive

The epoch variable in cyclegan primitive refers to the number of iterations over a given batch_size. However, a typical definition of epoch is the number of iterations over an entire dataset.

The current implementation will vary greatly between datasets and different batch_size values.

Ideas about anomaly scores

The find_anomaly primitive outputs a list of events, each associated with an anomaly score. These scores are some float values >= 0 and with no upper bound.

If we can have a sort of way to use integers from 0 to 10 to indicate the severity level, that would make more sense.

Let's update some solutions in this post.

Training & Test Set

Description

We should figure out which training/test split of the data makes most sense.
We currently use the whole available data for training if the user does not specify a test_size, i.e. the model is trained with anomalies.
We might want to remove anomalies from the training set such that the model can learn the normal behavior. In the original paper the training set also does not contain anomalies.

Implement the Orion Class

This issue is to track the development of the new Orion class.

The Orion class is responsible for handling the MLBlocks Pipelines that provide the central anomaly detection functionality in Orion.

Overall, the Orion class:

  • Provides simple user-facing abstractions
    • fit/detect
    • save/load
    • evaluate
  • Hides away the interaction with other systems
    • MLBlocks Pipelines
    • Pipeline Selection and Tuning (future?)

This should be the class public interface:

class Orion:

    def __init__(self,
        pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE,
        hyperparameters: dict = None):
        pass
    
    def fit(self, data: DataFrame):
        """Fit the pipeline to the given data.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
        """
        pass
        
    def detect(self, data: DataFrame,
               visualization: bool = True) -> DataFrame:
        """Detect anomalies in the given data..
        
        If ``visualization=True``, also return the visualization
        outputs from the MLPipeline object.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            visualization (bool):
                If ``True``, also capture the ``visualization`` named
                output from the ``MLPipeline`` and return it as a second
                output.
        
        Returns:
            DataFrame or tuple:
                If visualization is ``False``, it returns the events
                DataFrame. If visualization is ``True``, it returns a
                tuple containing the events DataFrame followed by the
                visualization outputs dict.
        """
        pass
        
    def fit_detect(self, data):
        """Fit the pipeline to pipeline and detect anomalies.
        
        This method is functionally equivalent to calling `fit(data)`
        and later on `detect(data)` but with the difference that
        here the `MLPipeline` is called only once, using its `fit`
        method, and the output is directly captured without having
        to execute the whole pipeline again during the `predict` phase.
        
        If ``visualization=True``, also return the visualization
        outputs from the MLPipeline object.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            visualization (bool):
                If ``True``, also capture the ``visualization`` named
                output from the ``MLPipeline`` and return it as a second
                output.
        
        Returns:
            DataFrame or tuple:
                If visualization is ``False``, it returns the events
                DataFrame. If visualization is ``True``, it returns a
                tuple containing the events DataFrame followed by the
                visualization outputs dict.
        """
        pass
    
    def save(self, path):
        """Save this object using pickle.
        
        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass
    
    @classmethod
    def load(cls, path) -> Orion:
        """Load an Orion instance from a pickle file.
        
        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.
        """
        pass
    
    def evaluate(cls, data: DataFrame, truth: DataFrame,
                 metrics: List[str] = DEFAULT_METRICS) -> Series:
        """Evaluate the performance against a ground truth.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            truth (DataFrame):
                Ground truth passed as a ``pandas.DataFrame`` containing
                two columns: start and stop.
            metrics (list):
                List of metrics to used passed as a list of strings.
                If not given, it defaults to all the Orion metrics.
        
        Returns:
            Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

Implement new functional interface

Once the Orion Class in #79 is implemented, add a functional interface that allows using Orion in as little steps as possible and hides away some of the irrelevant steps.

The api should be implemented as three functions:

  • fit_pipeline: Learn an Orion pipeline and save it.
  • detect_anomalies: Analyze a signal to detect anomalies. Optionally learn a pipeline.
  • evaluate_pipeline: Evaluate the performance of a pipeline against a list of known anomalies.
def fit_pipeline(
    data: Union[str, DataFrame] = None,
    pipeline: Union[str, Pipeline, dict] = None,
    hyperparameters: Union[str, DataFrame] = None,
    save_path: str = None) -> Orion:
    """Fit an Orion pipeline to the data.

    The pipeine can be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If no pipeline is passed, the default Orion pipeline is used.

    Args:
        data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as a
            hyperparameters ``dict`` in the ``mlblocks`` format or as a
            path to the corresponding JSON file. Defaults to
            ``None``.
        save_path (str):
            Path to the file where the fitted pipeline will be stored
            using ``pickle``. If not given, the Orion pipeline is
            returned. Defaults to ``None``.
    """
    pass
def detect_anomalies(
    data: Union[str, DataFrame] = None,
    pipeline: Union[str, Pipeline, dict] = None,
    hyperparameters: Union[str, DataFrame] = None,
    train_data: Union[str, DataFrame] = None) -> DataFrame:
    """Detect anomalies on timeseries data.

    The anomalies are detected using an Orion pipeline which can
    be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the path to a pickle file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If no pipeline is passed, the default Orion pipeline is used.

    Optionally, separated learning data can be passed to fit
    the pipeline to it before using it to detect anomalies.

    Args:
        data (str or DataFrame):
            Data to analyze searching for anomalies.
            It can be passed as a path to a CSV file or as a DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``str`` with the path to a pickle file.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as a
            hyperparameters ``dict`` in the ``mlblocks`` format or as a
            path to the corresponding JSON file. Defaults to
            ``None``.
        train_data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
            If not given, the pipeline is used without fitting it first.
    """
    pass
def evaluate_pipeline(
    data: Union[str, DataFrame] = None,
    truth: Union[str, DataFrame] = None,
    pipeline: Union[str, dict, MLPipeline] = None,
    hyperparameters: Union[str, DataFrame] = None,
    metrics: List[Union[callable, str]] = None,
    train_data: Union[str, DataFrame] = None) -> DataFrame:
    """Evaluate the performance of a pipeline.

    The pipeline is evaluated by executing it on a signal
    for which anomalies are known and then applying one or
    more metrics to it to compute scores.
    
    The pipeline can be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the path to a pickle file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If the pipeline is not fitted, it is possible to pass separated
    learning data can be passed to fit the pipeline to it before using
    it to detect anomalies.

    Args:
        data (str or DataFrame):
            Data to analyze searching for anomalies.
            It can be passed as a path to a CSV file or as a DataFrame.
        truth (str or DataFrame):
            Table of known anomalies to use as the ground truth for
            scoring. It can be passed as a path to a CSV file or as a
            DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``str`` with the path to a pickle file.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as
            a hyperparameters ``dict`` in the ``mlblocks`` format or as
            a path to the corresponding JSON file. Defaults to ``None``.
        metrics (list[str]):
            List of metrics to use. If not passed, all the Orion metrics
            are applied.
        train_data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
            If not given, the pipeline is used without fitting it first.
    """
    pass

Add pipeline scoring command

Add a command to evaluate one or more pipelines using a list (or all) of demo signals.

Interface should be like this:

orion evaluate [-m <metric_1> [-m <metric_2>]..] [-s <signal_1> [-s <signal_2>]...] <path/to/pipeline_1.json> [<path/to/pipeline_2.json>]...
  • Base Command: orion evaluate
  • -m - optional, multiple: metric(s) to use. If not given use all of them.
  • -s - optional, multiple: signal(s) to use. If not given, use all of them.
  • pipeline(s): one or more paths to pipelines.

The output should be an ASCII table with this exact format (valid md format):

|  pipeline  |  metric_1 |  metric_2 | ... |
|------------|-----------|-----------|-----|
| pipeline_1 | score_1_1 | score_1_2 | ... |
| pipeline_2 | score_2_1 | score_2_2 | ... |
|     ...    |    ...    |    ...    | ... |

Where score_i_j is the average of the metric_j score obtained by pipeline_i across all the signals used for the evaluation.

The ability to identify local anomalies

https://github.com/D3-AI/Orion/blob/7ed2956f7f1f7ee540c6e1137e0c02a6b2d0ea58/orion/pipelines/lstm_dynamic_threshold.json#L9

In this lstm pipeline, the function "find_anomalies" takes all errors into consideration to find a global threshold. This limits the model's capability to identify local anomalies.

We can refer to the method introduced in the article "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding", where h is specified as a hyperparameters as the number of historical error values used to evaluate current errors.

orion.analysis.analyze crashes when no event is found

  • Orion version: 0.1.0
  • Python version: 3.6
  • Operating System: all

Description

The problem happens running CSV analysis Example notebook. At line 38 of or orion.analysis.analyze, if events is empty pd.DataFrame returns an error.

File encoding/decoding issues about `README.md` and `HISTORY.md`

  • Operating System: Ubuntu 16

Description

Errors encountered when running make install-develop due to the file encoding/decoding issues.

Error message

pip install -e .[dev]
Obtaining file:///home/dongyu/apps/orion-test
    ERROR: Command errored out with exit status 1:
     command: /home/dongyu/.conda/envs/orion-test/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/dongyu/apps/orion-test/setup.py'"'"'; __file__='"'"'/home/dongyu/apps/orion-test/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info
         cwd: /home/dongyu/apps/orion-test/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/dongyu/apps/orion-test/setup.py", line 8, in <module>
        readme = readme_file.read()
      File "/home/dongyu/.conda/envs/orion-test/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 115: ordinal not in range(128)
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Makefile:82: recipe for target 'install-develop' failed
make: *** [install-develop] Error 1

Solution

# setup.py

try:
    with open('README.md', encoding='utf-8') as readme_file:
        readme = readme_file.read()
except IOError:
    readme = ''

try:
    with open('HISTORY.md', encoding='utf-8') as history_file:
        history = history_file.read()
except IOError:
    history = ''

Changing database schema

@kveerama and I discussed some changes of the DB schema:

  • add name and description fields to Datarun collection
  • add Experiment collection, which consists of multiple Dataruns. (We might want to restrict it such that one Experiment can only consist of multiple Dataruns with same Dataset)

The idea is that the first screen a user would see is the Experiments overview. If an Experiment is clicked, the different Dataruns belonging to the Experiment are shown. Thus, Experiments are a way of structuring multiple Dataruns. A Datarun consists of a Dataset and a Pipeline. When a Datarun is clicked, the Dataset itself and all Events and Comments belonging to the Datarun are displayed.

If anyone has additional thoughts on that, please feel free to comment.

Overlapping Thresholds

We should include overlapping thresholds in the find_anomalies primitive. This means that the windows for which the thresholds are calculated are overlapping, i.e. each point has multiple thresholds that could declare it an anomaly. This was done in the original implementation by NASA. I have already done that in my local pipeline and I will do a PR to include it in the standard MLPrimitive version, too.

Additional input dimensions

In the original implementation more than one input dimensions is used.
We are currently only using the telemetry value itself, but there are some additional command dimensions in the raw data.
However, the scores don't seem to be affected much by those additional dimensions.
We just might want to consider them when actually trying to compare the results of our pipeline to NASA's results.

Two bugs when saving signalrun if there is no event detected

Error message:

INFO:orion.runner:Processing pipeline arima predictions on signal 11125
ERROR:orion.db.schema:Error storing signalrun 5ef409e3c2be97adf0ee798c events
Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 242, in end
    for start_time, stop_time, severity in events:
TypeError: 'NoneType' object is not iterable
ERROR:orion.runner:Datarun 5ef409e3c2be97adf0ee798b crashed
Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 242, in end
    for start_time, stop_time, severity in events:
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/runner.py", line 99, in start_datarun
    start_signalrun(orex, datarun, signal)
  File "/home/dongyu/apps/orion/orion/runner.py", line 73, in start_signalrun
    signalrun.end(status, events)
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 253, in end
    status = self.STATUS_ERROR
AttributeError: 'Signalrun' object has no attribute 'STATUS_ERROR'
INFO:orion.runner:Datarun 5ef40a5bc2be97adf0ee798d started
INFO:orion.runner:Signalrun 5ef40a5bc2be97adf0ee798e started
INFO:orion.runner:Running pipeline lstm on signal 11125

BUG1:
TypeError: 'NoneType' object is not iterable
When there is no event detected, the variable events is expected as an empty list rather than None.

BUG2:
Signalrun' object has no attribute 'STATUS_ERROR'
Go to base.py, line 205 - 208:

STATUS_PENDING = 'PENDING'
STATUS_RUNNING = 'RUNNING'
STATUS_SUCCESS = 'SUCCESS'
STATUS_ERRORED = 'ERRORED'

However, in schema.py, line 253:

status = self.STATUS_ERROR

Constant name is incorrectly used.

Need validations during data preprocessing stage

When running Orion on certain datasets, we may encounter the following errors.

The start_time or stop_time of one signal is a strange negative value.
image

This error results from the invalid timestamp values existing on the raw data file. We have to do a format validation here I think when we load signals from certain locations.

This bug has a very harmful impact: if the start_time is a negative value and let's say the interval we used to aggregate data is 6 hours. We will then create huge amount of "unnecessary" intervals starting from the wrong start_time.

Make holdout optional in the evaluation

The current pipeline evaluation system does not support choosing whether to split the data into train and test partitions or to fit and predict on the same data.

This could be added as an option argument for both the python interface and CLI.

Add better exception messages

There are two kind of exceptions when dataset is not found:

  • AttributeError: NoneType object has not attribute data_location in the load_dataset function.
  • HTTP 404 from read_csv (pandas) in the load_nasa_signal function

Fix bottle neck of `score_anomaly` in Cyclegan primitive

In the current Cyclegan primitive implementation, this following piece of code causes a huge bottle neck in computing the anomaly score.

for c in critic:
   critic_extended = critic_extended + np.repeat(c, y_hat.shape[1]).tolist()

@dyuliu discovered that this is caused by concatenating +, the proposed fix is to use a simple extension of lists extend.

Discuss a good way to store signals and events in different granularities

We have to use signal processing pipeline to produce signals in different aggregation levels.

One of the required demands from the end-users is to look at the signal in different granularities, such as in 6 minutes, 30 minutes, 1 hour, 6 hours, etc.

(1) How to store the signals in the database to allow efficient data exploration? We have to make a tradeoff between time efficiency and space efficiency.
- solution1: store only the most fine-grained level signals so that we can use this data to infer the ones at a coarser level?
- solution2: store signals in every granularities ranging from 6 minutes to 6 hours.

(2) For one signal, it can be processed in different aggregation levels and then anomalous events for every aggregation level could be generated in different shapes. We need to think of a good way to organize these events that are related to the same signal but in different granularities.

Issues with the original code/paper

There are a few things in the original NASA project/paper that are a bit confusing, so I will list them here for your information and future reference:

  • The labels for the anomalous regions of the channels in the labeled_anomalies.csv file seem to be off by some number (presumably 260). In the original implementation this offset is cancelled out again, but the anomaly regions that are reported in the file are not correct per se and we have to take care of that in our implementation. This should already be reflected in the anomalies data set we have in our AWS bucket.
  • The F0.5 scores reported in the paper seem to be calculated incorrectly (their reported Precision and Recall scores should give a higher F0.5 score)
  • If we use the original code and retrain the models, we do not reach the same scores, but we are somewhere close (reported true F0.5 score is 0.859, while we reach 0.802. If the pre-trained models are taken, the score is quite similar to the paper).

A problem of the current DB schema of "Experiment"

Description

The DB schema of "Experiment" cannot allow retrieving detailed information of the running results, because we don't know which dataruns are related to this experiment.

Solution

Current schema:

class Experiment(Document, MongoUtils):
    project = fields.StringField()
    pipeline = fields.ReferenceField(Pipeline)
    dataset = fields.ReferenceField(Dataset)
    created_by = fields.StringField()

Proposed schema:

class Experiment(Document, MongoUtils):
    project = fields.StringField(required=True)
    pipeline = fields.ReferenceField(Pipeline)
    dataset = fields.ReferenceField(Dataset)
    created_by = fields.StringField()
    name = fields.StringField(required=True)
    events = fields.IntField(required=True)
    dataruns = fields.ListField(fields.ReferenceField(Datarun))
    start_time = fields.DateTimeField(required=True)
    end_time = fields.DateTimeField()
    status = fields.StringField()

The most important attribute is dataruns. If we don't store this information, we cannot retrieve any detailed information about this experiment. We can only know which signals are run in this experiment, but know nothing about the running results.

Create dummy primitives and pipeline

Create a set of simple dummy primitives that work with the expected input and generate the expected output and compile a demo pipeline that uses them.

CLI commands fail on empty databases

Some commands raise exceptions in an empty database:

  • $ orion list dataruns: KeyError: 'software_versions'
  • $ orion list events: AttributeError: 'DataFrame' object has no attribute 'event_id'
  • $ orion list comments: KeyError: 'insert_time'

dynamic scalability of TadGAN primitive based on `window_size`

In the current implementation of cyclegan, changing the window_size might not successfully work since units the layers are independent.

For example the encoder layer is defined as:

"layers_encoder": {
                "type": "list",
                "default": [
                    {
                        "class": "keras.layers.Bidirectional",
                        "parameters": {
                            "layer": {
                                "class": "keras.layers.LSTM",
                                "parameters": {
                                    "units": 100,
                                    "return_sequences": true
                                }
                            }
                        }
                    },
                    {
                        "class": "keras.layers.Flatten",
                        "parameters": {}
                    },
                    {
                        "class": "keras.layers.Dense",
                        "parameters": {
                            "units": 20
                        }
                    },
                    {
                        "class": "keras.layers.Reshape",
                        "parameters": {
                            "target_shape": "encoder_reshape_shape"
                        }
                    }
                ]
            }

In this case, the units of the LSTM layers should equal to the input size (window_size).

Unroll timeseries sequences based on `step_size`

  • Orion version: 0.1.1
  • Python version: 3.6

Description

At the moment, the support for timeseries prediction is unrolled as the "shrunked" version of the timeseries in case where step_size > 1.

A clear example of this is demonstrated here

Assume the original timeseries length is 10100, and window_size=100.

For step_size = 1
The input matrix is (10000, 100, 1) so y_hat is also (10000, 100, 1), then we take median of 100 numbers at every time step -> (10000, 1). The final output of primitive score_anomaly is (10000, 1) indicating the error at every time step.

For step_size = 5
The input matrix will be (2000, 100, 1) -> y_hat (2000, 100, 1) -> (2000, 1) -> the final output of primitive score_anomaly is (2000, 1). Also the index array is of shape (2000, 1). But in fact, we expect the output to be (10000, 1) also the recorded index array of the same shape (10000, 1)

Proposed Solution

Alter the score_anomaly function to take step_size as an argument and unroll the timeseries to the original dimension.

Integrate with Codecov

  • Integrate with Codecov to make sure all the changes from PR improve the code coverage of tests.
  • Update the readme with the new badge.

Extend the Database Model

The MongoDB class should include the methods to interact with the database following the predefined schema:

  • datasets

    • dataset id
    • name
    • signal set
    • satellite id
    • start time
    • stop time
    • date created
    • created by
  • events

    • event id
    • datarun id
    • dataset id
    • model location
    • metrics location
    • tag
    • start time
    • stop time
    • MIT team comments
    • SES team comments
  • dataruns

    • datarun id
    • dataset id
    • mlblocks version
    • mlprimitives version
    • btb version
    • copulas version
    • settings used
    • model settings
    • start time
    • end time
    • budget type
    • budget amount
    • model location
    • metrics location
    • created by

Shape matching API

As mentioned in issue #72, we would like to implement a feedback integration module in Orion which uses subsequence shape matching. Basically any suitable method could be used, therefore we should decide how these methods can be implemented and connected through an API in order to make them interchangeable.

Since we already rely on MLPrimitives for the unsupervised anomaly detection part, it would make sense to implement the shape matching methods as primitives as well.

Some things we should discuss:

  • Inputs and outputs to these primitives, so that we can easily integrate them into the feedback component of Orion
  • Where to store these primitives (Orion or MLPrimitives)

Integrate user feedback into Orion

One significant part of Orion is the user interaction, where users can annotate signals through MTV.
We can use these annotations to improve future anomaly detections.

A simple proposal of how this workflow could look like:

  • For each signal in the signalset of the datarun:
    • run pipeline on signal and find anomalies
    • get all known events that are related to the signal and have a annotation tag from database
    • For each known event:
      • get the aggregated signal (in intermediate outputs) from the datarun where the known event was found
      • get the shape of the sequence that was marked as anomalous in the known event
      • compare this shape to aggregated signal of current datarun using a specified method (e.g. DTW) and check if some subsequence is significantly closer than others
      • if there is a similar sequence, add an event with source 'shape matching' and a corresponding annotation tag that is similar to the tag of the original event
      • if there is any anomaly that was found in the current datarun, which overlaps with the known event, remove it from the list of found anomalies
    • add all remaining found anomalies as an event with source 'orion'

@sarahmish came up with a first skeleton of how we could implement that in Orion:

from orion.explorer import OrionExplorer
​
class OrionFeedback:
	""" this class manages the annotated events for a specified signal from 
	MTV and incorperates them back into Orion.
​
	should this be inhereted from the Orion Explorer?
	"""
​
	def execute_feedback(self, datarun, signal_id):
		""" this is the main method for feedback.
​
		this method loads the signal specified, fetches its known events,
		applies shape matching to the known events, then returns a resolved 
		list of labels for the signal.
​
		Attributes:
			- datarun: the datarun with the annotations
			- signal_id: the specific signal in the datarun which we are executing feedback
		"""
​
		signal = self.get_signal(signal_id)
		known_events = self.get_known_events(datarun, signal_id)
​
		for event in known_events:
			matched_events = self.shape_matching(signal, event)
			if self.overlap(matched_events, known_events):
				# priority: user > orion
				#			user > shape_matching
				#			shape_matching > orion
				#
				# if same priority: 
				#			user ? user
				#			shape_matching.match_score ? shape_matching.match_score
				# 			keep the higher score
				#
				# or have overlap favour anomalies over normal labels generally.
				pass
​
		return matched_events + known_events
​
	def shape_matching(self, signal, segment, method="dtw"):
		""" this methods returns the similarity score between signal and segment
​
		Attribute:
			- signal: is the signal data
			- segment: is the segment we want to match
			- method: is the algorithm used
		"""
		pass
​
	def overlap(self, shapes):
		""" this methods returns shapes after removing overlapped components, by keeping
		the higher scored ones
​
		Attribute:
			- shapes: a list of tuples, where a tuple is composed of (value, score)
		"""
​
		# remove overlap
		found.sort(key=lambda x: x["cost"])
		no_overlap = []
​
		for first_shape in found:
		    first_range = range(first_shape["id"], first_shape["id"]+window)
		    
		    flag = True
		    for second_shape in no_overlap:
		        second_range = range(second_shape["id"], second_shape["id"]+window)
		        
		        xs = set(first_range)
		        if len(xs.intersection(second_range)) > 0:
		            flag = False
		            
		    if flag:
		        no_overlap.append(first_shape)
​
	# helper functions
	def get_signal(self, signal):
		""" get signal from mongodb
		"""
​
		# exists in OrionExplorer
		pass
​
	def get_pipeline(self, pipeline):
		""" get pipeline from mongodb
		"""
​
		# exists in OrionExplorer
		pass
​
	def get_known_events(self, signal):
		""" get registered events for a particular signal in a given datarun 
		from mongodb
		"""
		pass

Some points that should be discussed:

  • Should this class be a subclass of the OrionExplorer, since it requires many of the same functionality?
  • How do we handle cases where multiple users annotated a sequence with different labels?
  • Should we use raw or aggregated signals for the shape matching?
  • What methods can be used for the shape matching besides DTW? Based on user annotations, can we use supervised (and maybe online) Machine Learning methods for subsequence classification?

I would appreciate having an offline installation

  • Orion version: 0.1.0
  • Python version: 3.6
  • Operating System: centos 7

Description

We usually work in highly constrained network access workstations.
It would be greatly appreciated to be able to install Orion without need to have internet access.
If only, provide a list of packages/repositories to have them installed/downloaded manually before attempting the Orion installation

Currently trying:
make install
output:
$ make install
fatal: Not a git repository (or any of the parent directories): .git
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '.egg-info' -exec rm -fr {} +
find . -name '
.egg' -exec rm -f {} +
find . -name '.pyc' -exec rm -f {} +
find . -name '
.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name 'pycache' -exec rm -fr {} +
pip install . -r requirements_dev.txt
Processing /home/dcalvo/Documents/MIT/Orion-master
Obtaining mlprimitives from git+https://github.com/HDI-Project/MLPrimitives.git@9b910e8c683bd69ac55aa4ef759f696fd91a7134#egg=mlprimitives (from -r requirements_dev.txt (line 1))
Cloning https://github.com/HDI-Project/MLPrimitives.git (to revision 9b910e8c683bd69ac55aa4ef759f696fd91a7134) to ./src/mlprimitives
fatal: unable to access 'https://github.com/HDI-Project/MLPrimitives.git/': Could not resolve host: github.com; Unknown error
Command "git clone -q https://github.com/HDI-Project/MLPrimitives.git /home/dcalvo/Documents/MIT/Orion-master/src/mlprimitives" failed with error code 128 in None
make: *** [install] Error 1

Thanks!

error while running make install

  • Orion version:Latest commit ac44a13
  • Python version: 3.6
  • Operating System: CENTOS 7

Description

Did a git pull and was following the install instructions.
In the make install, it fails due to failing to find pytest-runner.
When looking at the documentation of it (https://pypi.org/project/pytest-runner/) it explicitly states that test and setup requires should be removed and do the test differently.

What I Did

make install
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '*.egg-info' -exec rm -fr {} +
find . -name '*.egg' -exec rm -f {} +
find . -name '*.pyc' -exec rm -f {} +
find . -name '*.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name '__pycache__' -exec rm -fr {} +
pip install .
Processing /home/dcalvo/ORION
    ERROR: Command errored out with exit status 1:
     command: /home/anaconda/anaconda3/envs/sedate36_dev/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-skzlmbak/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-skzlmbak/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /tmp/pip-req-build-skzlmbak/
    Complete output (25 lines):
    Download error on https://pypi.org/simple/pytest-runner/: [Errno -2] Name or service not known -- Some packages may not be found!
    Couldn't find index page for 'pytest-runner' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [Errno -2] Name or service not known -- Some packages may not be found!
    No local packages or working download links found for pytest-runner>=2.11.1
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-skzlmbak/setup.py", line 112, in <module>
        zip_safe=False,
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
        _install_setup_requires(attrs)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
        replace_conflicting=True,
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 782, in resolve
        replace_conflicting=replace_conflicting
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1065, in best_match
        return self.obtain(req, installer)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1077, in obtain
        return installer(requirement)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
        return cmd.easy_install(req)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner>=2.11.1')
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
make: *** [install] Error 1

I simply commented the contents of setup_requires and tests_require

Scoring function for intervals of size one

The known anomalies in a time series could be of length one. In that case, the current segment based scoring function will not consider the existence of an overlap.

def _partition(expected, observed, start=None, end=None):
    edges = set()

    if start is not None:
        edges.add(start)

    if end is not None:
        edges.add(end)

    for edge in expected + observed:
        edges.update(edge)

    partitions = list()
    edges = sorted(edges)
    last = edges[0]
    for edge in edges[1:]:
        partitions.append((last, edge))
        last = edge

    expected_parts = list()
    observed_parts = list()
    weights = list()
    for part in partitions:
        weights.append(part[1] - part[0] + 1)
        expected_parts.append(_any_overlap(part, expected))
        observed_parts.append(_any_overlap(part, observed))

    return expected_parts, observed_parts, weights

If we look at the _partition function, we notice that edges is a set and will only include unique edges.
But an example of a perfectly detected anomalies can be as follows:

known_anomalies = 
| start | end |
|-------|-----|
| 3     | 3   |

observed anomalies = 
| start | end |
|-------|-----|
| 3     | 3   |

Now edges = {3}, and therefore, the result of this function will be an empty partition.

Add anomalous area around an anomaly

We should mark a certain area around found anomalies as anomalous as well. This will improve our pruning process afterwards, as it should not reclassify as many anomalies as normal again. I have already done that in my local implementation and will do a PR to MLPrimitives to include that in the find_anomalies primitive.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.