sintel-dev / orion Goto Github PK

A machine learning library for detecting anomalies in signals.

License: MIT License

Makefile 3.08% Python 96.38% Dockerfile 0.28% Shell 0.26%

anomaly-detection deep-learning machine-learning time-series benchmarking signals unsupervised-learning orion data-science generative-adversarial-network

orion's People

Contributors

Stargazers

Watchers

Forkers

danielcalvocerezo pvk-developer csala pythiac yzzhong abdullaho kanesp blackmailer75 forkedrepositories magiczheng-1997 geoheil jnhyeon evanimiya jamesbower qqq-tech ingyeomental codingdoctor oncores zwbjtu123 dongso jingmouren qiuchumo valeman sunnyhuma171 trendingtechnology jihwan1997 heronimus fotter abeusher ricklentz tspannhw rodski democh363 hadryan haiahems lyrl loopgod denny-hwang nlebang loveactualry daniel3424 ajayarora1235 xueliu8617112 henjoyy 0xthreebody jimmy-inl cheol-h-jeong pengchen233 yunyouhuang syedazkarul kronerte tylerchoi1224 meteoricfarm marker-inc-korea hell-to-heaven pumbaalau mdovale cl19951225 chen288 upchao linytsysu gjp1203 minghao2016 hattajr chanjonglee jaehoon9201 huilin-zhu sbhadade young-eun-nam aniruddhachoudhury zhouzelin064 suhwan-dev happyf tidydata ecowmin jooyelee ashwin-pokharel rjh1026 albact7 guitarmind smile-zbj lydia-kwon lysarthas waihoh sarahmish yuhongmo emghufran zoqls15 kyungnampark seunghoon9503 simonsleo jakcic yunkio rogersguo keshava cjtaylo-csu hramir psfc-hedp daanmanneke ericksim

orion's Issues

Unroll timeseries sequences based on `step_size`

Orion version: 0.1.1
Python version: 3.6

Description

At the moment, the support for timeseries prediction is unrolled as the "shrunked" version of the timeseries in case where step_size > 1.

A clear example of this is demonstrated here

Assume the original timeseries length is 10100, and window_size=100.

For step_size = 1
The input matrix is (10000, 100, 1) so y_hat is also (10000, 100, 1), then we take median of 100 numbers at every time step -> (10000, 1). The final output of primitive score_anomaly is (10000, 1) indicating the error at every time step.

For step_size = 5
The input matrix will be (2000, 100, 1) -> y_hat (2000, 100, 1) -> (2000, 1) -> the final output of primitive score_anomaly is (2000, 1). Also the index array is of shape (2000, 1). But in fact, we expect the output to be (10000, 1) also the recorded index array of the same shape (10000, 1)

Proposed Solution

Alter the score_anomaly function to take step_size as an argument and unroll the timeseries to the original dimension.

Rename Score to Severity

Currently, the analyze method returns a column called score, which indicates the severity of the anomaly.
This name is also in the Database Schema, currently as part a field of the Events class, possible elsewhere after #66 .

The problem with this name is that it can be confused with a goodness-of-fit score, such as accuracy or f1, so I want to propose changing it severity everywhere.

I would appreciate having an offline installation

Orion version: 0.1.0
Python version: 3.6
Operating System: centos 7

Description

We usually work in highly constrained network access workstations.
It would be greatly appreciated to be able to install Orion without need to have internet access.
If only, provide a list of packages/repositories to have them installed/downloaded manually before attempting the Orion installation

Currently trying:
make install
output:
$ make install
fatal: Not a git repository (or any of the parent directories): .git
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '.egg-info' -exec rm -fr {} +
find . -name '.egg' -exec rm -f {} +
find . -name '.pyc' -exec rm -f {} +
find . -name '.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name 'pycache' -exec rm -fr {} +
pip install . -r requirements_dev.txt
Processing /home/dcalvo/Documents/MIT/Orion-master
Obtaining mlprimitives from git+https://github.com/HDI-Project/MLPrimitives.git@9b910e8c683bd69ac55aa4ef759f696fd91a7134#egg=mlprimitives (from -r requirements_dev.txt (line 1))
Cloning https://github.com/HDI-Project/MLPrimitives.git (to revision 9b910e8c683bd69ac55aa4ef759f696fd91a7134) to ./src/mlprimitives
fatal: unable to access 'https://github.com/HDI-Project/MLPrimitives.git/': Could not resolve host: github.com; Unknown error
Command "git clone -q https://github.com/HDI-Project/MLPrimitives.git /home/dcalvo/Documents/MIT/Orion-master/src/mlprimitives" failed with error code 128 in None
make: *** [install] Error 1

Thanks!

Training & Test Set

Description

We should figure out which training/test split of the data makes most sense.
We currently use the whole available data for training if the user does not specify a test_size, i.e. the model is trained with anomalies.
We might want to remove anomalies from the training set such that the model can learn the normal behavior. In the original paper the training set also does not contain anomalies.

Exceptions are not caught when events=None

len(events) throws an exception when events=None

https://github.com/D3-AI/Orion/blob/7ed2956f7f1f7ee540c6e1137e0c02a6b2d0ea58/orion/explorer.py#L149

This makes the following line fails to save the datarun

https://github.com/D3-AI/Orion/blob/7ed2956f7f1f7ee540c6e1137e0c02a6b2d0ea58/orion/explorer.py#L150

The same issues happen here:

https://github.com/D3-AI/Orion/blob/7ed2956f7f1f7ee540c6e1137e0c02a6b2d0ea58/orion/analysis.py#L30

Create dummy primitives and pipeline

Create a set of simple dummy primitives that work with the expected input and generate the expected output and compile a demo pipeline that uses them.

Add better exception messages

There are two kind of exceptions when dataset is not found:

AttributeError: NoneType object has not attribute data_location in the load_dataset function.
HTTP 404 from read_csv (pandas) in the load_nasa_signal function

Make holdout optional in the evaluation

The current pipeline evaluation system does not support choosing whether to split the data into train and test partitions or to fit and predict on the same data.

This could be added as an option argument for both the python interface and CLI.

Intermediate outputs of pipeline

Description

We want to store intermediate outputs of the pipeline. The user might want to store more information about the datarun besides just the found anomalies.

Approach

We add another collection IO to the DB, which has the fields datarun and io. datarun is a reference to the datarun that created the intermediate output and io is a field that stores a dictionary.
A user then could use the pipeline JSON and specify what to store, following the specifications introduced to MLBlocks here.

Once we have that feature of MLBlocks, we can make use of the intermediate outputs and store them as a dictionary, using primitive.variable_name as key.

The anomalies are stored in the events collection just like before, therefore those should be returned normally and not as intermediate outputs.

Issues with the original code/paper

There are a few things in the original NASA project/paper that are a bit confusing, so I will list them here for your information and future reference:

The labels for the anomalous regions of the channels in the labeled_anomalies.csv file seem to be off by some number (presumably 260). In the original implementation this offset is cancelled out again, but the anomaly regions that are reported in the file are not correct per se and we have to take care of that in our implementation. This should already be reflected in the anomalies data set we have in our AWS bucket.
The F0.5 scores reported in the paper seem to be calculated incorrectly (their reported Precision and Recall scores should give a higher F0.5 score)
If we use the original code and retrain the models, we do not reach the same scores, but we are somewhere close (reported true F0.5 score is 0.859, while we reach 0.802. If the pre-trained models are taken, the score is quite similar to the paper).

Rolling predictions

Description

We want to be able to train and predict our model several times in a moving manner, meaning we want to train a model and predict anomalies for the first n timestamps, move the window by some steps and repeat the procedure. Therefore we also need to remove any anomalies that we find before we train on that data in the next step.

Suggestion

We will introduce a new primitive that allows to specify intervals that should be dropped from the data. Then we can just use the found anomalies and exclude them from the signal while we iterate over it.

In Orion we could just specify the window_size that we want to use and modify the analyze method to iterate over the signal:

def analyze(pipeline, X):

    pipeline = _load_pipeline(pipeline)

    found_intervals = []
    found_events = []

    start = 0
    training_size = 2000
    testing_size = 2000

    while start < len(X) - training_size - testing_size:
        train_window = X[start:start+training_size]

        if start + testing_size < len(X) - training_size - testing_size:
            test_window = X[start + training_size - 250:start + training_size + testing_size]
        else:
            test_window = X[start + training_size - 250:]

        pipeline.fit(train_window, train_ind=True, intervals=found_intervals)
        events = pipeline.predict(test_window, train_ind=False, intervals=found_intervals)
        if len(events) > 0:
            for event in events:
                found_events.append(event)
                found_intervals.append((event[0], event[1]))

        start = start + testing_size

    if len(found_events) == 0:
        found_events = list()
        found_events.append([X.iloc[0]['timestamp'], X.iloc[0]['timestamp'], 0])

    found_events = pd.DataFrame(np.vstack(found_events), columns=['start', 'end', 'score'])
    found_events['start'] = found_events['start'].astype(int)
    found_events['end'] = found_events['end'].astype(int)

    return found_events

The ability to identify local anomalies

https://github.com/D3-AI/Orion/blob/7ed2956f7f1f7ee540c6e1137e0c02a6b2d0ea58/orion/pipelines/lstm_dynamic_threshold.json#L9

In this lstm pipeline, the function "find_anomalies" takes all errors into consideration to find a global threshold. This limits the model's capability to identify local anomalies.

We can refer to the method introduced in the article "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding", where h is specified as a hyperparameters as the number of historical error values used to evaluate current errors.

Larger prediction window

In the original implementation a prediction window of size 10 is used. However, only the first value of that prediction sequence is used for the error calculation.
I am not sure if it was done intentionally (instead of using a prediction window of size 1).
This predicted values of both methods differ a bit, however the overall score is not affected much, likely because the predictions and thus also the errors are changed throughout the signal, resulting in a similar (but in general a bit higher) error curve, to which the threshold is applied. So I am not sure if we need to implement that in our pipeline at the moment.

User Interface

A user interface class should be created (named Orion?) that provides the user-level functionality of the project.

(Previous step): The user prepares primitives and a pipeline that uses them (JSON).
The user creates the Orion class passing:
- arguments needed to load the data
- the pipeline to be used (JSON path?)
- the DB connection details (or DB object)
(optional) The user calls the tune method to tune the pipeline on the data.
The user calls the evaluate method to compute a score for the pipeline (default or tuned).
The user calls the save method to store the pipeline in the DB with all the required details.

Threshold for unusually low errors

In the original implementation another threshold is used to detect unusually low errors.
The intuition is that unusually low errors are anomalous as well, as the prediction is more accurate than on average. This is done by 'flipping' the errors 'around the mean' and applying the existing threshold function again.
However, this improves the score just slightly, so we could make it a boolean parameter for the find_anomalies primitive whether to use the low-error threshold or not and therefore making it optional to use.

Add a TimeSeriesAnomalyDetector

There will be a single class in the estimators module. It will contain all the logic required to fit, score, tune and obtain predictions from a pipeline, it will contain the following methods:

__init__: The class constructor. It will receive the pipeline in the MLBlocks format, along with the cross-validation configuration and persistence options.
tune: Will use BTB and the cross-validation and scorer from the constructor to find the best hyperparameters for the given template.
fit: Will fit a pipeline using the hyperparameters found by the tune method or the ones given as an argument to the init method.
predict: Will use the pipeline to make predictions. It will raise an exception if the pipeline has not been fitted.

Add integration with CircleCI

Integrate with CircleCI in order to run tests and check for code issues after each commit and PR.

Implement the Orion Class

This issue is to track the development of the new Orion class.

The Orion class is responsible for handling the MLBlocks Pipelines that provide the central anomaly detection functionality in Orion.

Overall, the Orion class:

Provides simple user-facing abstractions
- fit/detect
- save/load
- evaluate
Hides away the interaction with other systems
- MLBlocks Pipelines
- Pipeline Selection and Tuning (future?)

This should be the class public interface:

class Orion:

    def __init__(self,
        pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE,
        hyperparameters: dict = None):
        pass
    
    def fit(self, data: DataFrame):
        """Fit the pipeline to the given data.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
        """
        pass
        
    def detect(self, data: DataFrame,
               visualization: bool = True) -> DataFrame:
        """Detect anomalies in the given data..
        
        If ``visualization=True``, also return the visualization
        outputs from the MLPipeline object.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            visualization (bool):
                If ``True``, also capture the ``visualization`` named
                output from the ``MLPipeline`` and return it as a second
                output.
        
        Returns:
            DataFrame or tuple:
                If visualization is ``False``, it returns the events
                DataFrame. If visualization is ``True``, it returns a
                tuple containing the events DataFrame followed by the
                visualization outputs dict.
        """
        pass
        
    def fit_detect(self, data):
        """Fit the pipeline to pipeline and detect anomalies.
        
        This method is functionally equivalent to calling `fit(data)`
        and later on `detect(data)` but with the difference that
        here the `MLPipeline` is called only once, using its `fit`
        method, and the output is directly captured without having
        to execute the whole pipeline again during the `predict` phase.
        
        If ``visualization=True``, also return the visualization
        outputs from the MLPipeline object.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            visualization (bool):
                If ``True``, also capture the ``visualization`` named
                output from the ``MLPipeline`` and return it as a second
                output.
        
        Returns:
            DataFrame or tuple:
                If visualization is ``False``, it returns the events
                DataFrame. If visualization is ``True``, it returns a
                tuple containing the events DataFrame followed by the
                visualization outputs dict.
        """
        pass
    
    def save(self, path):
        """Save this object using pickle.
        
        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass
    
    @classmethod
    def load(cls, path) -> Orion:
        """Load an Orion instance from a pickle file.
        
        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.
        """
        pass
    
    def evaluate(cls, data: DataFrame, truth: DataFrame,
                 metrics: List[str] = DEFAULT_METRICS) -> Series:
        """Evaluate the performance against a ground truth.
        
        Args:
            data (DataFrame):
                Input data, passed as a ``pandas.DataFrame`` containing
                exactly two columns: timestamp and value.
            truth (DataFrame):
                Ground truth passed as a ``pandas.DataFrame`` containing
                two columns: start and stop.
            metrics (list):
                List of metrics to used passed as a list of strings.
                If not given, it defaults to all the Orion metrics.
        
        Returns:
            Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

get_available_templates is not defined

OrionDBExplorer Tutorial.ipynb:

Error message:
ImportError: cannot import name 'get_available_templates'

Overlapping Thresholds

We should include overlapping thresholds in the find_anomalies primitive. This means that the windows for which the thresholds are calculated are overlapping, i.e. each point has multiple thresholds that could declare it an anomaly. This was done in the original implementation by NASA. I have already done that in my local pipeline and I will do a PR to include it in the standard MLPrimitive version, too.

Integrate with Codecov

Integrate with Codecov to make sure all the changes from PR improve the code coverage of tests.
Update the readme with the new badge.

File encoding/decoding issues about `README.md` and `HISTORY.md`

Operating System: Ubuntu 16

Description

Errors encountered when running make install-develop due to the file encoding/decoding issues.

Error message

pip install -e .[dev]
Obtaining file:///home/dongyu/apps/orion-test
    ERROR: Command errored out with exit status 1:
     command: /home/dongyu/.conda/envs/orion-test/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/dongyu/apps/orion-test/setup.py'"'"'; __file__='"'"'/home/dongyu/apps/orion-test/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info
         cwd: /home/dongyu/apps/orion-test/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/dongyu/apps/orion-test/setup.py", line 8, in <module>
        readme = readme_file.read()
      File "/home/dongyu/.conda/envs/orion-test/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 115: ordinal not in range(128)
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Makefile:82: recipe for target 'install-develop' failed
make: *** [install-develop] Error 1

Solution

# setup.py

try:
    with open('README.md', encoding='utf-8') as readme_file:
        readme = readme_file.read()
except IOError:
    readme = ''

try:
    with open('HISTORY.md', encoding='utf-8') as history_file:
        history = history_file.read()
except IOError:
    history = ''

Add CircleCI badge on README

We should replace the TravisCI badge with the CircleCI one.

orion.analysis.analyze crashes when no event is found

Orion version: 0.1.0
Python version: 3.6
Operating System: all

Description

The problem happens running CSV analysis Example notebook. At line 38 of or orion.analysis.analyze, if events is empty pd.DataFrame returns an error.

Add the S3 location explicitly

At the moment, the command orion add dataset {name} expects an optional positional argument location which points at a local CSV.
If the location is not given or does not point at a local CSV, the name is assumed to be the name of one of the datasets hosted in Amazon S3.

Let's make this explicit, with the following format:

orion add dataset {name} s3://{bucket_name}/path/to/the.csv

Of example, the S-1 dataset could be added as:

orion add dataset S-1 s3://d3-ai-orion/S-1.csv

Optionally, a boolean flag --demo could be added so that one is allowed to execute orion add dataset S-1 --demo and skip the S3 bucket specification.

Two bugs when saving signalrun if there is no event detected

Error message:

INFO:orion.runner:Processing pipeline arima predictions on signal 11125
ERROR:orion.db.schema:Error storing signalrun 5ef409e3c2be97adf0ee798c events
Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 242, in end
    for start_time, stop_time, severity in events:
TypeError: 'NoneType' object is not iterable
ERROR:orion.runner:Datarun 5ef409e3c2be97adf0ee798b crashed
Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 242, in end
    for start_time, stop_time, severity in events:
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dongyu/apps/orion/orion/runner.py", line 99, in start_datarun
    start_signalrun(orex, datarun, signal)
  File "/home/dongyu/apps/orion/orion/runner.py", line 73, in start_signalrun
    signalrun.end(status, events)
  File "/home/dongyu/apps/orion/orion/db/schema.py", line 253, in end
    status = self.STATUS_ERROR
AttributeError: 'Signalrun' object has no attribute 'STATUS_ERROR'
INFO:orion.runner:Datarun 5ef40a5bc2be97adf0ee798d started
INFO:orion.runner:Signalrun 5ef40a5bc2be97adf0ee798e started
INFO:orion.runner:Running pipeline lstm on signal 11125

BUG1:
TypeError: 'NoneType' object is not iterable
When there is no event detected, the variable events is expected as an empty list rather than None.

BUG2:
Signalrun' object has no attribute 'STATUS_ERROR'
Go to base.py, line 205 - 208:

STATUS_PENDING = 'PENDING'
STATUS_RUNNING = 'RUNNING'
STATUS_SUCCESS = 'SUCCESS'
STATUS_ERRORED = 'ERRORED'

However, in schema.py, line 253:

status = self.STATUS_ERROR

Constant name is incorrectly used.

Rename evaluation to benchmark and metrics to evaluation

Currently, the orion package is organized as follows.

orion
├── evaluation.py 
└── metrics.py

With the new metrics subpackage, we would like to rename it to evaluation and update evaluation.py to benchmark.py.

orion
├── benchmark.py 
└── orion/evaluation -> used to be metrics.py
    ├── __init__.py 
    ├── common.py 
    ├── contextual.py 
    ├── point.py 
    └── utils.py

Ideas about anomaly scores

The find_anomaly primitive outputs a list of events, each associated with an anomaly score. These scores are some float values >= 0 and with no upper bound.

If we can have a sort of way to use integers from 0 to 10 to indicate the severity level, that would make more sense.

Let's update some solutions in this post.

Changing database schema

@kveerama and I discussed some changes of the DB schema:

add name and description fields to Datarun collection
add Experiment collection, which consists of multiple Dataruns. (We might want to restrict it such that one Experiment can only consist of multiple Dataruns with same Dataset)

The idea is that the first screen a user would see is the Experiments overview. If an Experiment is clicked, the different Dataruns belonging to the Experiment are shown. Thus, Experiments are a way of structuring multiple Dataruns. A Datarun consists of a Dataset and a Pipeline. When a Datarun is clicked, the Dataset itself and all Events and Comments belonging to the Datarun are displayed.

If anyone has additional thoughts on that, please feel free to comment.

Scoring function for intervals of size one

The known anomalies in a time series could be of length one. In that case, the current segment based scoring function will not consider the existence of an overlap.

def _partition(expected, observed, start=None, end=None):
    edges = set()

    if start is not None:
        edges.add(start)

    if end is not None:
        edges.add(end)

    for edge in expected + observed:
        edges.update(edge)

    partitions = list()
    edges = sorted(edges)
    last = edges[0]
    for edge in edges[1:]:
        partitions.append((last, edge))
        last = edge

    expected_parts = list()
    observed_parts = list()
    weights = list()
    for part in partitions:
        weights.append(part[1] - part[0] + 1)
        expected_parts.append(_any_overlap(part, expected))
        observed_parts.append(_any_overlap(part, observed))

    return expected_parts, observed_parts, weights

If we look at the _partition function, we notice that edges is a set and will only include unique edges.
But an example of a perfectly detected anomalies can be as follows:

known_anomalies = 
| start | end |
|-------|-----|
| 3     | 3   |

observed anomalies = 
| start | end |
|-------|-----|
| 3     | 3   |

Now edges = {3}, and therefore, the result of this function will be an empty partition.

Fix bottle neck of `score_anomaly` in Cyclegan primitive

In the current Cyclegan primitive implementation, this following piece of code causes a huge bottle neck in computing the anomaly score.

for c in critic:
   critic_extended = critic_extended + np.repeat(c, y_hat.shape[1]).tolist()

@dyuliu discovered that this is caused by concatenating +, the proposed fix is to use a simple extension of lists extend.

dynamic scalability of TadGAN primitive based on `window_size`

In the current implementation of cyclegan, changing the window_size might not successfully work since units the layers are independent.

For example the encoder layer is defined as:

"layers_encoder": {
                "type": "list",
                "default": [
                    {
                        "class": "keras.layers.Bidirectional",
                        "parameters": {
                            "layer": {
                                "class": "keras.layers.LSTM",
                                "parameters": {
                                    "units": 100,
                                    "return_sequences": true
                                }
                            }
                        }
                    },
                    {
                        "class": "keras.layers.Flatten",
                        "parameters": {}
                    },
                    {
                        "class": "keras.layers.Dense",
                        "parameters": {
                            "units": 20
                        }
                    },
                    {
                        "class": "keras.layers.Reshape",
                        "parameters": {
                            "target_shape": "encoder_reshape_shape"
                        }
                    }
                ]
            }

In this case, the units of the LSTM layers should equal to the input size (window_size).

Need validations during data preprocessing stage

When running Orion on certain datasets, we may encounter the following errors.

The start_time or stop_time of one signal is a strange negative value.

This error results from the invalid timestamp values existing on the raw data file. We have to do a format validation here I think when we load signals from certain locations.

This bug has a very harmful impact: if the start_time is a negative value and let's say the interval we used to aggregate data is 6 hours. We will then create huge amount of "unnecessary" intervals starting from the wrong start_time.

Add F0.5 score to metrics

Description

In the original paper a f0.5 score is used for evaluating the result. We might want to add that metric as well.
sklearn.metrics.fbeta_score with beta=0.5 should achieve that.

Shape matching API

As mentioned in issue #72, we would like to implement a feedback integration module in Orion which uses subsequence shape matching. Basically any suitable method could be used, therefore we should decide how these methods can be implemented and connected through an API in order to make them interchangeable.

Since we already rely on MLPrimitives for the unsupervised anomaly detection part, it would make sense to implement the shape matching methods as primitives as well.

Some things we should discuss:

Inputs and outputs to these primitives, so that we can easily integrate them into the feedback component of Orion
Where to store these primitives (Orion or MLPrimitives)

Additional input dimensions

In the original implementation more than one input dimensions is used.
We are currently only using the telemetry value itself, but there are some additional command dimensions in the raw data.
However, the scores don't seem to be affected much by those additional dimensions.
We just might want to consider them when actually trying to compare the results of our pipeline to NASA's results.

Add anomalous area around an anomaly

We should mark a certain area around found anomalies as anomalous as well. This will improve our pruning process afterwards, as it should not reclassify as many anomalies as normal again. I have already done that in my local implementation and will do a PR to MLPrimitives to include that in the find_anomalies primitive.

Adjust `epoch` meaning in Cyclegan primitive

The epoch variable in cyclegan primitive refers to the number of iterations over a given batch_size. However, a typical definition of epoch is the number of iterations over an entire dataset.

The current implementation will vary greatly between datasets and different batch_size values.

Add pipeline scoring command

Add a command to evaluate one or more pipelines using a list (or all) of demo signals.

Interface should be like this:

orion evaluate [-m <metric_1> [-m <metric_2>]..] [-s <signal_1> [-s <signal_2>]...] <path/to/pipeline_1.json> [<path/to/pipeline_2.json>]...

Base Command: orion evaluate
-m - optional, multiple: metric(s) to use. If not given use all of them.
-s - optional, multiple: signal(s) to use. If not given, use all of them.
pipeline(s): one or more paths to pipelines.

The output should be an ASCII table with this exact format (valid md format):

|  pipeline  |  metric_1 |  metric_2 | ... |
|------------|-----------|-----------|-----|
| pipeline_1 | score_1_1 | score_1_2 | ... |
| pipeline_2 | score_2_1 | score_2_2 | ... |
|     ...    |    ...    |    ...    | ... |

Where score_i_j is the average of the metric_j score obtained by pipeline_i across all the signals used for the evaluation.

error while running make install

Orion version:Latest commit ac44a13
Python version: 3.6
Operating System: CENTOS 7

Description

Did a git pull and was following the install instructions.
In the make install, it fails due to failing to find pytest-runner.
When looking at the documentation of it (https://pypi.org/project/pytest-runner/) it explicitly states that test and setup requires should be removed and do the test differently.

What I Did

make install
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '*.egg-info' -exec rm -fr {} +
find . -name '*.egg' -exec rm -f {} +
find . -name '*.pyc' -exec rm -f {} +
find . -name '*.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name '__pycache__' -exec rm -fr {} +
pip install .
Processing /home/dcalvo/ORION
    ERROR: Command errored out with exit status 1:
     command: /home/anaconda/anaconda3/envs/sedate36_dev/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-skzlmbak/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-skzlmbak/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /tmp/pip-req-build-skzlmbak/
    Complete output (25 lines):
    Download error on https://pypi.org/simple/pytest-runner/: [Errno -2] Name or service not known -- Some packages may not be found!
    Couldn't find index page for 'pytest-runner' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [Errno -2] Name or service not known -- Some packages may not be found!
    No local packages or working download links found for pytest-runner>=2.11.1
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-skzlmbak/setup.py", line 112, in <module>
        zip_safe=False,
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
        _install_setup_requires(attrs)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
        replace_conflicting=True,
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 782, in resolve
        replace_conflicting=replace_conflicting
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1065, in best_match
        return self.obtain(req, installer)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1077, in obtain
        return installer(requirement)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
        return cmd.easy_install(req)
      File "/home/anaconda/anaconda3/envs/sedate36_dev/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner>=2.11.1')
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
make: *** [install] Error 1

I simply commented the contents of setup_requires and tests_require

Extend the Database Model

The MongoDB class should include the methods to interact with the database following the predefined schema:

datasets
- dataset id
- name
- signal set
- satellite id
- start time
- stop time
- date created
- created by
events
- event id
- datarun id
- dataset id
- model location
- metrics location
- tag
- start time
- stop time
- MIT team comments
- SES team comments
dataruns
- datarun id
- dataset id
- mlblocks version
- mlprimitives version
- btb version
- copulas version
- settings used
- model settings
- start time
- end time
- budget type
- budget amount
- model location
- metrics location
- created by

Integrate user feedback into Orion

One significant part of Orion is the user interaction, where users can annotate signals through MTV.
We can use these annotations to improve future anomaly detections.

A simple proposal of how this workflow could look like:

For each signal in the signalset of the datarun:
- run pipeline on signal and find anomalies
- get all known events that are related to the signal and have a annotation tag from database
- For each known event:
  - get the aggregated signal (in intermediate outputs) from the datarun where the known event was found
  - get the shape of the sequence that was marked as anomalous in the known event
  - compare this shape to aggregated signal of current datarun using a specified method (e.g. DTW) and check if some subsequence is significantly closer than others
  - if there is a similar sequence, add an event with source 'shape matching' and a corresponding annotation tag that is similar to the tag of the original event
  - if there is any anomaly that was found in the current datarun, which overlaps with the known event, remove it from the list of found anomalies
- add all remaining found anomalies as an event with source 'orion'

@sarahmish came up with a first skeleton of how we could implement that in Orion:

from orion.explorer import OrionExplorer

class OrionFeedback:
	""" this class manages the annotated events for a specified signal from 
	MTV and incorperates them back into Orion.

	should this be inhereted from the Orion Explorer?
	"""

	def execute_feedback(self, datarun, signal_id):
		""" this is the main method for feedback.

		this method loads the signal specified, fetches its known events,
		applies shape matching to the known events, then returns a resolved 
		list of labels for the signal.

		Attributes:
			- datarun: the datarun with the annotations
			- signal_id: the specific signal in the datarun which we are executing feedback
		"""

		signal = self.get_signal(signal_id)
		known_events = self.get_known_events(datarun, signal_id)

		for event in known_events:
			matched_events = self.shape_matching(signal, event)
			if self.overlap(matched_events, known_events):
				# priority: user > orion
				#			user > shape_matching
				#			shape_matching > orion
				#
				# if same priority: 
				#			user ? user
				#			shape_matching.match_score ? shape_matching.match_score
				# 			keep the higher score
				#
				# or have overlap favour anomalies over normal labels generally.
				pass

		return matched_events + known_events

	def shape_matching(self, signal, segment, method="dtw"):
		""" this methods returns the similarity score between signal and segment

		Attribute:
			- signal: is the signal data
			- segment: is the segment we want to match
			- method: is the algorithm used
		"""
		pass

	def overlap(self, shapes):
		""" this methods returns shapes after removing overlapped components, by keeping
		the higher scored ones

		Attribute:
			- shapes: a list of tuples, where a tuple is composed of (value, score)
		"""

		# remove overlap
		found.sort(key=lambda x: x["cost"])
		no_overlap = []

		for first_shape in found:
		    first_range = range(first_shape["id"], first_shape["id"]+window)
		    
		    flag = True
		    for second_shape in no_overlap:
		        second_range = range(second_shape["id"], second_shape["id"]+window)
		        
		        xs = set(first_range)
		        if len(xs.intersection(second_range)) > 0:
		            flag = False
		            
		    if flag:
		        no_overlap.append(first_shape)

	# helper functions
	def get_signal(self, signal):
		""" get signal from mongodb
		"""

		# exists in OrionExplorer
		pass

	def get_pipeline(self, pipeline):
		""" get pipeline from mongodb
		"""

		# exists in OrionExplorer
		pass

	def get_known_events(self, signal):
		""" get registered events for a particular signal in a given datarun 
		from mongodb
		"""
		pass

Some points that should be discussed:

Should this class be a subclass of the OrionExplorer, since it requires many of the same functionality?
How do we handle cases where multiple users annotated a sequence with different labels?
Should we use raw or aggregated signals for the shape matching?
What methods can be used for the shape matching besides DTW? Based on user annotations, can we use supervised (and maybe online) Machine Learning methods for subsequence classification?

CLI commands fail on empty databases

Some commands raise exceptions in an empty database:

$ orion list dataruns: KeyError: 'software_versions'
$ orion list events: AttributeError: 'DataFrame' object has no attribute 'event_id'
$ orion list comments: KeyError: 'insert_time'

Newly proposed database schema

See the changes here:
https://github.com/D3-AI/Orion/blob/issue%2366_update_db_readme/DATABASE.md

Use of Early Stopping Criteria

This might avoid over fitting and can be done using callbacks in keras.

Create a Data Loading interface

Download data from https://s3-us-west-2.amazonaws.com/telemanom/data.zip and set it up in our own custom format in an S3 bucket.

Then create a class or function to download it from S3 and load it in the format required by our TSAD class.

A problem of the current DB schema of "Experiment"

Description

The DB schema of "Experiment" cannot allow retrieving detailed information of the running results, because we don't know which dataruns are related to this experiment.

Solution

Current schema:

class Experiment(Document, MongoUtils):
    project = fields.StringField()
    pipeline = fields.ReferenceField(Pipeline)
    dataset = fields.ReferenceField(Dataset)
    created_by = fields.StringField()

Proposed schema:

class Experiment(Document, MongoUtils):
    project = fields.StringField(required=True)
    pipeline = fields.ReferenceField(Pipeline)
    dataset = fields.ReferenceField(Dataset)
    created_by = fields.StringField()
    name = fields.StringField(required=True)
    events = fields.IntField(required=True)
    dataruns = fields.ListField(fields.ReferenceField(Datarun))
    start_time = fields.DateTimeField(required=True)
    end_time = fields.DateTimeField()
    status = fields.StringField()

The most important attribute is dataruns. If we don't store this information, we cannot retrieve any detailed information about this experiment. We can only know which signals are run in this experiment, but know nothing about the running results.

Discuss a good way to store signals and events in different granularities

We have to use signal processing pipeline to produce signals in different aggregation levels.

One of the required demands from the end-users is to look at the signal in different granularities, such as in 6 minutes, 30 minutes, 1 hour, 6 hours, etc.

(1) How to store the signals in the database to allow efficient data exploration? We have to make a tradeoff between time efficiency and space efficiency.
- solution1: store only the most fine-grained level signals so that we can use this data to infer the ones at a coarser level?
- solution2: store signals in every granularities ranging from 6 minutes to 6 hours.

(2) For one signal, it can be processed in different aggregation levels and then anomalous events for every aggregation level could be generated in different shapes. We need to think of a good way to organize these events that are related to the same signal but in different granularities.

Implement new functional interface

Once the Orion Class in #79 is implemented, add a functional interface that allows using Orion in as little steps as possible and hides away some of the irrelevant steps.

The api should be implemented as three functions:

fit_pipeline: Learn an Orion pipeline and save it.
detect_anomalies: Analyze a signal to detect anomalies. Optionally learn a pipeline.
evaluate_pipeline: Evaluate the performance of a pipeline against a list of known anomalies.

def fit_pipeline(
    data: Union[str, DataFrame] = None,
    pipeline: Union[str, Pipeline, dict] = None,
    hyperparameters: Union[str, DataFrame] = None,
    save_path: str = None) -> Orion:
    """Fit an Orion pipeline to the data.

    The pipeine can be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If no pipeline is passed, the default Orion pipeline is used.

    Args:
        data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as a
            hyperparameters ``dict`` in the ``mlblocks`` format or as a
            path to the corresponding JSON file. Defaults to
            ``None``.
        save_path (str):
            Path to the file where the fitted pipeline will be stored
            using ``pickle``. If not given, the Orion pipeline is
            returned. Defaults to ``None``.
    """
    pass

def detect_anomalies(
    data: Union[str, DataFrame] = None,
    pipeline: Union[str, Pipeline, dict] = None,
    hyperparameters: Union[str, DataFrame] = None,
    train_data: Union[str, DataFrame] = None) -> DataFrame:
    """Detect anomalies on timeseries data.

    The anomalies are detected using an Orion pipeline which can
    be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the path to a pickle file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If no pipeline is passed, the default Orion pipeline is used.

    Optionally, separated learning data can be passed to fit
    the pipeline to it before using it to detect anomalies.

    Args:
        data (str or DataFrame):
            Data to analyze searching for anomalies.
            It can be passed as a path to a CSV file or as a DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``str`` with the path to a pickle file.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as a
            hyperparameters ``dict`` in the ``mlblocks`` format or as a
            path to the corresponding JSON file. Defaults to
            ``None``.
        train_data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
            If not given, the pipeline is used without fitting it first.
    """
    pass

def evaluate_pipeline(
    data: Union[str, DataFrame] = None,
    truth: Union[str, DataFrame] = None,
    pipeline: Union[str, dict, MLPipeline] = None,
    hyperparameters: Union[str, DataFrame] = None,
    metrics: List[Union[callable, str]] = None,
    train_data: Union[str, DataFrame] = None) -> DataFrame:
    """Evaluate the performance of a pipeline.

    The pipeline is evaluated by executing it on a signal
    for which anomalies are known and then applying one or
    more metrics to it to compute scores.
    
    The pipeline can be passed as:
        * An ``str`` with a path to a JSON file.
        * An ``str`` with the path to a pickle file.
        * An ``str`` with the name of a registered Orion pipeline.
        * An ``MLPipeline`` instance.
        * A ``dict`` with an ``MLPipeline`` specification.

    If the pipeline is not fitted, it is possible to pass separated
    learning data can be passed to fit the pipeline to it before using
    it to detect anomalies.

    Args:
        data (str or DataFrame):
            Data to analyze searching for anomalies.
            It can be passed as a path to a CSV file or as a DataFrame.
        truth (str or DataFrame):
            Table of known anomalies to use as the ground truth for
            scoring. It can be passed as a path to a CSV file or as a
            DataFrame.
        pipeline (str or Pipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``str`` with the path to a pickle file.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as
            a hyperparameters ``dict`` in the ``mlblocks`` format or as
            a path to the corresponding JSON file. Defaults to ``None``.
        metrics (list[str]):
            List of metrics to use. If not passed, all the Orion metrics
            are applied.
        train_data (str or DataFrame):
            Data to which the pipeline should be fitted.
            It can be passed as a path to a CSV file or as a DataFrame.
            If not given, the pipeline is used without fitting it first.
    """
    pass

Checking how important is the use of `batch_size`

Currently, we are using batch_size=32 (default from keras) and paper uses batch_size=64

sintel-dev / orion Goto Github PK

orion's People

Contributors

Stargazers

Watchers

Forkers

orion's Issues

Description

Proposed Solution

Description

Description

Description

Approach

Description

Suggestion

Description

Error message

Solution

Description

Description

Description

What I Did

Description

Solution

Recommend Projects

Recommend Topics

Recommend Org