Coder Social home page Coder Social logo

nezha's Introduction

Nezha

This repository is the basic implementation of our publication in FSE'23 conference paper Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data

Description

Nezha is an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multimodal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way.

Quick Start

Requirements

  • Python3.6 is recommended to run the anomaly detection. Otherwise, any python3 version should be fine.
  • Git is also needed.

Setup

Download Nezha first via git clone [email protected]:IntelligentDDS/Nezha.git

Enter Nezha content by cd Nezha

python3.6 -m pip install -r requirements.txt to install the dependency for Nezha

Running Nezha

OnlineBoutique at service level

python3.6 ./main.py --ns hipster --level service 

pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 96.428571 %

OnlineBoutique at inner service level

python3.6 ./main.py --ns hipster --level inner

pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 96.428571 %

Trainticket at service level

python3.6 ./main.py --ns ts --level service

pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 97.777778 %

Trainticket at inner service level

python3.6 ./main.py --ns ts --level inner

pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 97.777778 %

The details of service level results and inner-service level results will be printed and recorded in ./log

Dataset

2022-08-22 and 2022-08-23 is the fault-suffering dataset of OnlineBoutique

2023-01-29 and 2023-01-30 is the fault-suffering dataset of Trainticket

Fault-free data

construct_data is the data of fault-free phase

root_cause_hipster.json is the inner-servie level label of root causes in OnlineBoutique

root_cause_ts.json is the inner-servie level label of root causes in Trainticket

As an example,

    "checkoutservice": {
        "return": "Start charge card_Charge successfully",
        "exception": "Start charge card_Charge successfully",
        "network_delay": "NetworkP90(ms)",
        "cpu_contention": "CpuUsageRate(%)",
        "cpu_consumed": "CpuUsageRate(%)"
    },

The label of checkoutservice means that the label return fault of checkoutservice is core regions between log statement contains Start charge card and Charge successfully.

Fault-suffering Data

rca_data is the data of fault-suffering phase

2022-08-22-fault_list and 2022-08-23-fault_list is the servie level label of root causes in OnlineBoutique

2023-01-29-fault_list and 2022-01-30-fault_list is the servie level label of root causes in TrainTicket

Project Structure

.
├── LICENSE
├── README.md
├── construct_data
│   ├── 2022-08-22
│   │   ├── log
│   │   ├── metric
│   │   ├── trace
│   │   └── traceid
│   ├── 2022-08-23
│   ├── 2023-01-29
│   ├── 2023-01-30
│   ├── root_cause_hipster.json: label at inner-service level for OnlineBoutique
│   └── root_cause_ts.json: label at inner-service level for ts
├── rca_data
│   ├── 2022-08-22
│   │   ├── log
│   │   ├── metric
│   │   ├── trace
│   │   ├── traceid
│   │   └── 2022-08-22-fault_list.json: label at service level
│   ├── 2022-08-23
│   ├── 2023-01-29
│   └── 2023-01-30
├── log: RCA result
├── log_template: drain3 config 
├── alarm.py: generate alarm 
├── data_integrate.py: transform metric, log, and trace to event graph 
├── log_parsing.py: parsing logs
├── log.py: record logs
├── pattern_miner.py: mine patterns from event graph
├── pattern_ranker.py: rank suspicious patterns
├── main.py: running nezha
└── requirements.txt

Reference

Please cite our FSE'23 paper if you find this work is helpful.

@inproceedings{nezha,
  title={Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data},
  author={Yu, Guangba and Chen, Pengfei and Li, Yufeng and Chen, Hongyang and Li, Xiaoyun and Zheng, Zibin},
  booktitle={ESEC/FSE 2023},
  pages={},
  year={2023},
  organization={ACM}
}

nezha's People

Contributors

mmantyla avatar yuxiaoba avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

nezha's Issues

About Nezhe, beseek help

Greetings.
First of all, I sincerely appreciate your generosity in providing the source code.
We plan to use Nezha as part of the baseline of our new platform, for which I need to reproduce Nezhe on five multi-modal datasets.
In this process, I've met two problem and had some tiny questions about Nezha.
The first problem is about the test environment. In requirements.txt, Orange3_Associate==1.1.9 was in demand, which also requires numpy==1.15.4. But Orange3_Associate1.1.9 requires Orange3>=3.25.0 and openTSNE>=0.6.1, which two requires numpy>=1.16.0.
The second problem may be caused by the first one: I've successfully run Nezha with Orange3_Associate==1.1.9 and numpy==1.15.4 on python3.6. But the outcome was awful. Here's part of the result:
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:650: --------AS@1 Result-------
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:651: 5.357143 %
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:652: --------AS@3 Result-------
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:653: 7.142857 %
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:654: --------AS@5 Result-------
[INFO]2023-11-13 21:04:39,357 pattern_ranker.py:655: 7.142857 %
I'd like to know what caused this.
And I noticed that your source code provide your own data, which contains both trace_id file and trace file. If my understanding was correct, the code only use traces when its id appreared in the trace_id file (reflected in data_integrate.py and in the function data_integrate()). I wonder why.
Please provide illuminate and dispell the darkness for us, amen.

Replication package

Hi,

I was wondering if you have made or can make public the complete replication package of your study. I was particularly interested in your implementation of MicroRCA and TraceAnomaly algorithms for your benchmark.

Thanks in advance!

Reproducibility matter

Hi my idol @yuxiaoba 😄

I've been trying to reproduce the results of Nezha on Hister shop. I followed the instructions carefully to install the environment, but unable to reproduce the results. I have attached here the results I got.

image

I see other people also have the number exactly like me (#1)

I'm wondering whether you have met this problem before (and how did you solve it?). I know you're busy, but please help when you have time. Much appreciate! 😄

MicroRCA and TraceAnomaly implementations

Hi,

I have gone through your paper and published code here on GitHub and I was wondering if you've made public the way you implemented MicroRCA and TraceAnomaly for the comparison with Nezha.

I inspected the available code here but I don't think I've seen their usage.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.