We describe here how we detect anomalies using Ripe Atlas DNS CHAOS measurements.
- Ground truth:
data/k-root-ddos-20151130.csv
: this file covers k-root during the ddos of nov30th , when we know for sure there were anomalies. As in the Root DDoS paper. - Random day:
data/k-root-20170401.csv
Theare are many possible algorithms for time series anomaly detection. Most of them rely on periodical data to operate. In this hackaton, we used three algorithms:
- Twitter's Robust Time Series Anomaly Detection
- ARIMA
- and a ad-hoc algorithm derived from our experience
- others?
We have run both ARIMA and Twitter's, and even though they detect anomalies automatically, we still need to define what is an anomaly. For example, a variation of 3ms on median RTT values is not necessarily an anomaly.
This module take as input CSV files generated by the download and parser, located at <$ADD PATH HERE>
The data flow is as follows:
- DNS measurement ID from chaos.id DNS measurement
- The downloader.sh download and parses it
- The anomalyDetector module runs on it and generate another file that specifies where are the anomalies
- This information is then stored and plotted in the module that does the visualization
We document in this section what is the criteria for anomalies
We use the definition of Letter as in the Root DDoS paper, Fig1:
- Letter: IP address of an anycast server (e.g.: k-root, or ns1.dns.nl)
- Site: a physical location of a anycast site (e.g: kroot-ams )
- Server: a server under a site (e.g.: k-root-ams-srv1)
We know that under stable conditions, anycast is pretty stable (see this paper). Meaning that probes should reach the same site over and over.
For a letter level, therefore, we can define the following anomalies: Below we have a sample file (we only consider rcode=0 answers):
timestamp | nProbes | nSites | nQueries | nResponses | q25RTT | q50RTT | q75RTT | q90RTT |
---|---|---|---|---|---|---|---|---|
1448841600 | 8890 | 24 | 22759 | 13887 | 15.7070 | 32.9710 | 58.8360 | 135.8580 |
So as we have seen in the Root DDoS paper, under an event, the nProbes goes down. So at a letter level, we define the following anomalies:
- F1: Reachability failture: nProbes go down, rtt values may or not go up (if the server is mostly down, and just few probes respond, the RTT might not change that much)
- F2: Performance issues: nProbes does not go down, but the RTT go up
To detect them, we propose:
- F1: nProbes number is reduced in at least 3x the standard deviation + median
- F2: q50RTT (quartile 50, or median) or q50TT goes at least 3x the standard devaition + median
To run it:
python letter-level-detector.py $input $output
- F3: the number of probes go median up or below 3x the standard deviation: sites can take the load from others if others go down (see Figure 5 on Root DDoS paper).
- F2: same as for sites
To run it:
python site-level-detector.py $input $output
@Jan Harm