logintelligence / logadempirical Goto Github PK

View Code? Open in Web Editor NEW

145.0 3.0 37.0 4.51 MB

Log-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)

License: MIT License

Python 100.00%

deep-learning log-analysis log-based-anomaly-detection

logadempirical's Introduction

Log-based Anomaly Detection with Deep Learning: How Far Are We?

Under extension. Please refer the dev branch.

Abstract: Software-intensive systems produce logs for troubleshooting purposes. Recently, many deep learning models have been proposed to automatically detect system anomalies based on log data. These models typically claim very high detection accuracy. For example, most models report an F-measure greater than 0.9 on the commonly-used HDFS dataset. To achieve a profound understanding of how far we are from solving the problem of log-based anomaly detection, in this paper, we conduct an in-depth analysis of five state-of-the-art deep learning-based models for detecting system anomalies on four public log datasets. Our experiments focus on several aspects of model evaluation, including training data selection, data grouping, class distribution, data noise, and early detection ability. Our results point out that all these aspects have significant impact on the evaluation, and that all the studied models do not always work well. The problem of log-based anomaly detection has not been solved yet. Based on our findings, we also suggest possible future work. This repository provides the implementation of recent log-based anomaly detection methods.

Studied Models

Model	Paper
DeepLog	DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
LogAnomaly	LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs
PLELog	Semi-Supervised Log-Based Anomaly Detection via Probabilistic Label Estimation
LogRobust	Robust log-based anomaly detection on unstable log data
CNN	Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network

Requirements

Python 3
NVIDIA GPU + CUDA cuDNN
PyTorch 1.7.0

The required packages are listed in requirements.txt. Install:

pip install -r requirements.txt

Demo

Example of DeepLog on BGL with fixed window size of 1 hour:

python main_run.py --folder=bgl/ --log_file=BGL.log --dataset_name=bgl --model_name=deeplog --window_type=sliding
 --sample=sliding_window --is_logkey --train_size=0.8 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100
 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=150 --history_size=10 --lr=0.001
 --accumulation_step=5 --session_level=hour --window_size=60 --step_size=60 --output_dir=experimental_results/demo
/random/ --is_process

For more explanation of parameters:

python main_run.py --help

Citation

If you find the code and models useful for your research, please cite the following paper:

@inproceedings{le2022log,
  title={Log-based Anomaly Detection with Deep Learning: How Far Are We?},
  author={Le, Van-Hoang and Zhang, Hongyu},
  booktitle={2022 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)},
  year={2022}
}

logadempirical's People

Contributors

Stargazers

Watchers

logadempirical's Issues

The code did not run successfully

Guys, this code is missing something. According to the demo script to run the code for a long time or failed to run。

Is there a big guy out there who can fix this? Thank you very much！！！
You can add me on wechat :wlf0104 qq:1780130585
Or send me Google Mail：[email protected]

LogAnomaly model design fault

In logadempirical/logdeep/models/lstm.py line 109, the inputs are set as:

input0, input1 = features[2], features[1]

where features[2] is semantic pattern according to logadempirical/logdeep/dataset/sample.py. However, in the original code of in logdeep repository, we have:

input0, input1 = features[0], features[1]

where features[0] is sequential_pattern.

Hence, the current LogAnomaly model in this rep is not working properly.

ValueError while running LogAnomaly

Hi, I was trying to run loganomaly model, in the cmd line I simply change the model option to 'loganomaly'. It shows this error:

File "/home/LogADEmpirical/logadempirical/logdeep/dataset/vocab.py", line 46, in find_similar
    if sim > 0.90:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Could you please help with this?

Update requirements

Could you please update requirements for the project please? I suppose it's outdated.

Some question about loading dataset,expecially hdfs.

I have some questions about loading dataset,expecially hdfs.

RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?

The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust.
I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".

My question is why do you use "history_size" to fix the data sequence, including session window HDFS ?
As a result, the length of the final training dataset sequence is "history_size".

Is there any mistake in my understanding? Can you help me answer it?

RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?

Missing HDFS.log_structured.csv file

Hello, I was trying to run deeplog on hdfs dataset but ended up with the following error.

command I used:
!python main_run.py --folder=bgl/ --log_file=HDFS.log --dataset_name=hdfs --model_name=deeplog --window_type=sliding\ --sample=sliding_window --is_logkey --train_size=0.8 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100\ --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=150 --history_size=10 --lr=0.001\ --accumulation_step=5 --session_level=hour --window_size=60 --step_size=60 --output_dir=experimental_results/demo/random/ --is_process

Are we supposed to run other scripts first to generate such files (for example data_loader.py or synthesize.py)
Can we re-run the code with other formats of HDFS dataset which are publicly available?
Thanks,

What part of the thunderbird dataset did you use ?

I am trying to reproduce the results of your paper.
In your paper, you wrote"We leverage 10 million continuous log lines..." for the Thunderbird dataset, could you tell me which part you actually used?
(I tried using the first 10 million lines but it seemed different)

How can i get the raw Spirit dataset

How can i get the raw Spirit dataset（172 million log messages）

HDFS log dataset

Do you use the same HDFS log dataset as in DeepLog paper? Could you please provide the log dataset? Or anywhere can I view the logs?

the result is not reproducible

Hi, I was trying out deeplog using HDFS1 dataset (used only first 1m lines parsed by Drain).

I run it with the following parameters settings:
python main_run.py --folder=hdfs_1m/ --log_file=HDFS_1m.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process

This is the result:
Precision: 86.964%, Recall: 53.931%, F1-measure: 66.576%, Specificity: 0.996
(I have tried with different param settings, there's not much of a difference)

Could you please have an advice for me? Thanks in advance!

why train = train[:100000] in PLELog?

In logadempirical/PLELog/data/DataLoader.py line 324, there is this line:

train = train[:100000]

which restricts the training dataset to the first 100000 logs. Why it has been applied?

In fact, it makes the training size different than what is reported in the paper.

Similarly on line 330:
val = val[:20000]

How can l find this BGL.log_structured.csv?

When l run this program following your README.md, this error occured.

FileNotFoundError: [Errno 2] No such file or directory: './dataset/bgl/BGL.log_structured.csv'

And I do not know how to change the BLG.log to BGL.log_structured.csv.

License?

This is great work! I just found that there is no license information provided in the repository, making it impossible to be officially reused by others. Can you please add license information? Thank you!

Benchmark is missing

Could you please provide a benchmark (details of argument values for each experiment) for accurate reproduction of results?

Perhaps, similar to the Loglizer repository which has a benchmarks folder.

ValueError: 6 columns passed, passed data had 5 columns

I follow README.md demo shell script，and I got this error.
Please tell me how to change the parameter to avoid this problem.

Confusion about log parsing

I have a confusion about how to set parameters for parsing.I tried to parse HDFS with the settings in the figure, but there was obviously a problem. What is the right way?
There are four data sources. How should we set the parameters for parsing?

Deeplog input is empty

In logadempirical/logdeep/dataset/sample.py line 161, the value of sequential_pattern is set to empty list:

sequential_pattern = []

which is the main input of deeplog model. It caused an error during running deeplog saying the input is empty.

However, according to logdeep rep, this feature is supposed to be initialized as:
sequential_pattern = line[i:i + window_size]

Please update the code.

How can l find this BGL.log_structured.csv and third and hdfs

Missing data['Seq'], what‘s 'seq' ？

Hello, I have some questions about the 'seq' field in the data；

I tried to find how 'seq' was generated in the code. However, in addition to the generated part of the session window, the training data saved by the sliding window does not contain this part. I think it's an index list of logs, does that make sense?
Is there any mistake in my understanding? Could you please help with this?

embeddings.json

I'm trying to reproduce your results (like another poster here)...

Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?

python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset

Traceback (most recent call last):
  File "main_run.py", line 213, in <module>
    main()
  File "main_run.py", line 195, in main
    run_deeplog(options)
  File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
    Trainer(options).start_train()
  File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
    train_logs, train_labels = sliding_window(data,
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
    event2semantic_vec = read_json(os.path.join(data_dir, e_name))
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
    with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'