microsoft / anomalydetector Goto Github PK

SR-CNN

License: MIT License

Python 97.62% Cython 2.38%

anomalydetector's Introduction

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Users can run SR by refering sample here

https://github.com/microsoft/anomalydetector/blob/master/main.py This sample only RUN SR, for SR-CNN please refer the below section. Both SR and SR-CNN use the same evaluation in evaluate.py.

The SR-CNN project is consisted of three major parts.
1.generate_data.py is used for preprocess the data, where the original continuous time series are splited according to window size and artificial outliers are injected in proportion.
python generate_data.py --data <dataset>
where dataset is the file name of data folder.If you want to change the default config, you can use the command line args:
python generate_data.py -data <dataset> --window 256 --step 128
2.train.py is the network trianing module of SR-CNN. SR transformer is applied on each time-series before training.
python trian.py -data <dataset>
3.evalue.py is the evaluation module.As mentioned in our paper,
We evaluate our model from three aspects,accuracy,efficiency and generality.We use precision,recall and F1-score to indicate the accuracy of our model.In real applications,the human operators do not care about the point-wise metrics. It is acceptable for an algorithm to trigger an alert for any point in a contiguous anomaly segment if the delay is not too long.Thus,we adopt the evaluation strategy following[23].We mark the whole segment of continuous anomalies as a positive sample which means no matter how many anomalies have been detected in this segment,only one effective detection will be counted.If any point in ananomaly segment can be detected by the algorithm,and the delay of this point is no more than k from the start point of the anomaly segment, we say this segment is detected correctly.Thus,all points in this segment are treated as correct,and the points outside the anomaly segments are treated as normal.
we set different delays to verify whether a whole section of anomalies can be detected in time. For example, When delay = 7, for an entire segment of anomaly, if the anomaly detector can issue an alarm at its first 7 points, it is considered that the entire segment of anomaly has been successfully detected, otherwise it is considered to have not been detected.
Run the code:
python evalue.py -data <dataset>

anomalydetector's People

Contributors

Stargazers

Watchers

Forkers

opaquezxd elvirasun28 niceban amazonnnn cave-g-f shinedove bluesea2013 lukegs7 hdjkfhkj chendhui github553 maxsaju sambalshikhar royalwal sccc19 bhaskers-blu-org2 zhangbiqiong taffywrinkle fudp claudiusgonzo melodous shuncask alanhu1024 hyssss wangss97 likebupt zhangxianglive88 kapiya stzwooju subhamkhemka likun1234 prashantthakurnitp akashghanta unosonu iacheson jinlmsft kelizhang yukuncn tonylibing cslele standardgalactic ad431 emerrf nbijlani g-santini tarnjotbains kevin0327666 nolanthorne hamidelmaazouz widesu vineeththayanithi betty2008sh iampauladdai echizen456 psharifi10 johnnystargazer seraphlich you-yongbin awoodancer darkisildur mark2qin frost373 z-collective-io euisuk-chung

anomalydetector's Issues

关于SR-CNN的一些疑问

你好，感谢你们做的工作~因为看作者是**人就用中文问了。
1.论文中提到SR和SR-CNN都在kpi竞赛的数据集取得一个很高的分数，这个分数是基于原数据集提供的标签计算出来的吗，还是说是通过像代码中提到的异常注入的方式得到标签进而训练得到的结果呢？
2.在generate_data.py文件中有这样一段代码具体是什么意思呢？论文中好像没有提到

if (self.win_siz - 6) not in ids:
    self.control += np.random.random()
else:
    self.control = 0
if self.control > 100:
    ids[0] = self.win_siz - 6
    self.control = 0

谢谢！

Not true streaming pipeline in SR implementation?

After reading the code, I found that SR outputs the result of one whole window every time (via batch_size), instead of only outputting the result of the latest point. It seems this is not the true steaming method? Actually, we should detect and output the result of the latest point at each timestep, right? I think the detection of non-overlapping windows is not really reasonable, especially when comparing the efficiency... Thanks.

evaluation error

train.py

when I use the command "python train.py --data ....",It appears the error that
File "train.py", line 27, in
from srcnn.utils import *
ModuleNotFoundError: No module named 'srcnn'
When I solve this problem ,I mark the "srcnn" directory as source root,but the problem also exists.

Issue with "batch_size" in main.py

I find your approach very exciting and wanted to try out your code.
As a basis I use Anaoconda on a Win 10 PC.
After the installation I had some trouble (not recognized _anomaly_kernel_cython.c )which could be solved by compiling the cpython module (see here).

please use the following command to compile the cpython module.
python setup.py build_ext --inplace

Originally posted by @guinao in #19 (comment)

When trying to run the main.py script, the following error message appears

File "main.py", line 17, in <module> detect_anomaly(sample, THRESHOLD, MAG_WINDOW, SCORE_WINDOW, 99, DetectMode.anomaly_only) File "main.py", line 9, in detect_anomaly sensitivity=sensitivity, detect_mode=detect_mode) TypeError: __init__() missing 1 required positional argument: 'batch_size'

Can anyone give me a hint what value the variable batch_size should have?

Many thanks in advance

question about the data generation

Thank you sharing the code. I was trying to understand the code around the synthetic data generation.

Can you kindly elaborate on what these lines are doing:

 if (self.win_siz - 6) not in ids:
        self.control += np.random.random()
 else:
        self.control = 0
 if self.control > 100:
        ids[0] = self.win_siz - 6
        self.control = 0

In particular,I am puzzled by what they achieve and the role of the control variable. Could you kindly elaborate on what is happening with this code?

Reproducibility Yahoo results

Hello,

was anyone able to reproduce the Yahoo results from the paper? The results I am able to produce are not even close to the reported performance?

@authors, it would be great if you could provide the respective hyperparameter settings.

All the best,
Christian

Problem about the size of anomalydetector repository——too big

the size of anomalydetector repository is too big which is caused by anomalydetector/.git/objects/pack. git cached some big files or data that is useless for this repository.

Not able to generate thedata using sample.csv

Earlier when I used to run the following command (python3 srcnn/generate_data.py samples/sample.csv) it used the process the data as desired and the entire code used to work without an issue. But currently I get the following error and I am not sure if this is a bug or something that I messed up

python3 srcnn/generate_data.py samples/sample.csv
Traceback (most recent call last):
File "srcnn/generate_data.py", line 28, in
from srcnn.utils import *
ModuleNotFoundError: No module named 'srcnn'

Kindly advice. And thanks in advance 😄

Question about saliency map

Hi,
The idea about saliency map construction using fft is interesting. But i have some questions here. As we know, time series anomaly detection usually simulates the normal data shape, value or distribution to calculate the distance from the abnormal data, which can be viewed as reconstruction error (e.g. PCA and VAE methods). If the time-series data is mapped to frequency domain and sent to a filter (as shown in the code), the seasonality of the reconstruction data (ifft) is just removed. So why the derived saliency map can represent the abnormal score of the corresponding time-series data? And what is the concrete physical meaning behind the saliency map using fft?

_pickle.UnpicklingError: could not find MARK

File "evalue.py", line 121, in
total_time, results, savedscore = get_score(data_source, files, args.thres, args.missing_option)
File "evalue.py", line 74, in get_score
tmp_data = read_pkl(f)
File "C:\anomalydetector-master\anomalydetector-master\srcnn\utils.py", line 47, in read_pkl
def read_pkl(path):
with open(path, 'rb') as f:
return pickle.load(f)#this is where it throws error

Issue with using srcnn

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Unresolved reference 'median_filter'

boundary_utils.py

import bisect
import numpy as np
from msanomalydetector._anomaly_kernel_cython import median_filter

Unresolved reference 'median_filter'，How to solve this problem

Questions about model evaluation

I have two questions regarding model evaluation:

Why is the frame to be evaluated (x) multiplied by 100 during call to modelwork function? isn't it affecting data normalization?
Is it supposed the data must be normalized before the call to sr_cnn_eval?

thank you in advance.

difference from SR-CNN paper and ms Azure Anomaly Detector API

SrR-CNN need a window param(which is 1440 in KPI dataset), which enforce the input sequence length 1440 at least. How Azure Anomaly Detector API can handle ant length sequence?

Data parameter for generate_data.py

Hi,
generate_data.py is not taking the path of csv file as one of the parameters. My dataset is in srcnn/data. I get the following error :

How do I solve this issue?

For generality comparison, the three classes (seasonal, stable and unstable) of Yahoo dataset?

Thank you for your outstanding work!

In the paper, you mentioned "To evaluate generality, we group the time-series in Yahoo dataset into 3 major classes (for example, seasonal, stable and unstable as shown in Figure 1) manually", so could you provide this manual classification for Yahoo? Sometimes people will have different classification results based on the observation, especially for seasonal and unstable time series, which is a matter when reproducing and comparing the performance.

Thanks very much!

Questions abotu usability of hyperparameters and what they mean.

These are the hyperparameters that are specified within the code but there is no documentation that describes how exactly do they affect anomaly detection. Could you please shed some light on what each of these exactly mean for the model, especially MAG_WINDOW and SCORE_WINDOW?

MAX_RATIO = 0.25
EPS = 1e-8
THRESHOLD = 0.3
MAG_WINDOW = 3
SCORE_WINDOW = 40

Extend Series

The accompanying paper (https://arxiv.org/pdf/1906.03821.pdf) notes the SR part of the algorithm is extending the series by taking the gradient between the last observed point and the 'm' prior points.

The extend_series method to the SpectralResidual class is calculating the extension to the series as:

extension = [SpectralResidual.predict_next(values[-look_ahead - 2:-1])] * extend_num

The -look_ahead-2:-1 slice is passing the 6th to last to 2nd to last values to the predict_next method. This seems to stand in constrast to formula 8 in the paper, that indicates the average gradient used in the prediction is calculated by comparing the last observation with the prior 5 observations.

Wouldn't the predict_next method be comparing the 2nd to last value to the its prior five values as coded?

It seems like values[-look_ahead - 1:] would fit more with the description in the paper by passing the last observed value and the prior 5 values to the function.

Questions about real time predictions from an incoming stream of data.

The way the following usage has been covered (from the documentation), it is only possible to use detect() for detecting anomalies within the time series that it has been trained with. How do we save a trained model and pass new incoming data to the detector?

def detect_anomaly(series, threshold, mag_window, score_window, sensitivity, detect_mode, batch_size):
    detector = SpectralResidual(series=series, threshold=threshold, mag_window=mag_window, score_window=score_window,
                                sensitivity=sensitivity, detect_mode=detect_mode, batch_size=batch_size)
    return detector.detect()

Support for multivariate time series

Hello, i'm studying this algorithm for my thesis and I want to extend the support of this method also for multivariate time series.
Is there already a working project or not? You have some hints for that? I think that a very trivial starting point is to apply the fourier transform to each feature and then calculate the spectral residual.

import cython error

When I executing "from _anomaly_kernel_cython import median_filter" -->
ImportError: Building module _anomaly_kernel_cython failed: ['distutils.errors.DistutilsPlatformError: Unable to find vcvarsall.bat\n']

I have installed Microsoft LightSwitch for Visual Studio 2015 14.0.25431.01 Update 3
System: windows 10
python: 3.6

How can I solve it?

anomalies?

The picture mark is a bit high. What is the reason for the abnormal point?(714,715)

n = 730
A = 50
center = 100
phi = 30
T = 2 * np.pi / 100
t = np.arange(n)
sin = A * np.sin(T * t - phi * T) + center
sin[235:255] = 80 //input Time series data