bric3 / drain-java Goto Github PK

View Code? Open in Web Editor NEW

21.0 3.0 9.0 1.04 MB

This a pet project to explore log pattern extraction using DRAIN

License: Mozilla Public License 2.0

Java 100.00%

drain log tail java template-mining

drain-java's Introduction

drain java

Introduction

drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.

Usage

First, Java 11 is required to run drain-java.

As a dependency

You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core, currently only snapshots are available by adding this repository.

repositories {
    maven {
        url("https://oss.sonatype.org/content/repositories/snapshots/")
    }
}

From command line

Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it’s not a finished product, anything could change.

Example usage

$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
            [--parse-after-str=FIXED_STRING_SEPARATOR]
            [--parser-after-col=COLUMN] FILE
...
      FILE          log file
  -d, --drain       use DRAIN to extract log patterns
  -f, --follow      output appended data as the file grows
  -h, --help        Show this help message and exit.
  -n, --lines=NUM   output the last NUM lines, instead of the last 10; or use
                      -n 0 to output starting from beginning
      --parse-after-str=FIXED_STRING_SEPARATOR
                    when using DRAIN remove the left part of a log line up to
                      after the FIXED_STRING_SEPARATOR
      --parser-after-col=COLUMN
                    when using DRAIN remove the left part of a log line up to
                      COLUMN
  -V, --version     Print version information and exit.
      --verbose     Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64

By default, the tool act similarly to tail, and it will output the file to the stdout. The tool can follow a file if the --follow option is passed. However, when run with the --drain this tool will classify log lines using DRAIN, and will output identified clusters. Note that this tool doesn’t handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
  -d \ (1)
  -n 0 \ (2)
  --parse-after-str "]: " (3)
  build/resources/test/SSH.log (4)

Identify patterns in the log
Starts from the beginning of the file (otherwise it starts from the last 10 lines)
Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively ignoring some variable elements like the time.
The log file

log pattern clusters and their occurences

---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters (1)
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 (2)
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...

51 types of logs were identified from 655147 lines in 1.588s
There was 140768 similar log messages with this pattern, with 3 positions where the token is identified as parameter <*>.

On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.

From Java

This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:

Minimal DRAIN example

var drain = Drain.drainBuilder()
                 .additionalDelimiters("_")
                 .depth(4)
                 .build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
            StandardCharsets.UTF_8)
     .forEach(drain::parseLogMessage);

// do something with clusters
drain.clusters();

Status

Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here’s what I would like to do.

Todo

❏ More unit tests
✓ Wire things together
❏ More documentation
✓ Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)
❏ In follow drain mode dump clusters on forced exit (e.g. for example when hitting ctrl+c)
✓ Start reading from the last x lines (like tail -n 30)
❏ Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)

For later

❏ Json message field extraction
❏ How to handle prefixes : Dates, log level, etc. ; possibly using masking
❏ Investigate marker with specific behavior, e.g. log level severity
❏ Investigate log with stacktraces (likely multiline)
❏ Improve handling of very long lines
❏ Logback appender with micrometer counter

Motivation

I was inspired by a blog article from one of my colleague on LogMine, — many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of Datadog’s Log explorer, his blog post sparked my interest.

After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn’t yield any result, although I certainly didn’t search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.

References

The Drain port is mostly a port of Drain3 done by IBM folks (David Ohana, Moshik Hershcovitch). IBM’s Drain3 is a fork of the original work done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.

I didn’t follow up on other contributors of these projects, reach out if you think you have been omitted.

For reference here’s the linked I looked at:

https://logparser.readthedocs.io/
https://github.com/logpai/logparser
https://github.com/IBM/Drain3
https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf (a copy of this publication accessible there)

drain-java's People

Contributors

Stargazers

Watchers

Forkers

todorkriv beetecu ygbm upupfeng wph95 jzsoftwarehouse icebergsmart tianjiqx ycpanda

drain-java's Issues

Question

Hello, did you port https://github.com/IBM/Drain3 or original logpai implementation?
In readme - you mension IBM gays, but i see code mismaches (may be it's because IBM project activly updated).

Process Json log events

Currently the code understand a line as a log event message. However in some production systems, the application can use structured logging using a Json document. The document may contains many additional metadata, or context data, but what drain is interested in is the message, for this reason it has to be able to extract the message from the document given a path.

Usually log events are serialized in a single line json document, see logstash-logback-encoder for example. So when parsing json, the events will be assumed to be single line. However the message itself may be multiline (new line are likely encoded as \n).

So the only work to do is to pre-process the string line as a Json and extract the message field.

Enable cluster lookup given a log message

In #5 introced a interesting feature to look for the cluster of a certain log.

The method findLogMessage will only look for an existing log cluster. This might be interesting to implement for search feature.

Thanks to @TodorKrIv for the idea and initial implementation.

Allows to pass custom masks

Currently the code uses really simply tricks to mask some elements of a log event, eg by stripping the date component.

However there are other dynamic log components that may be worth to mask, IPs, UUIDs, etc.

Print _log clusters_ periodically or on a signal or key combination

Currently in drain mode, the discovered log clusters are only dumped (printed) when the log file has been entirely read.

This is not suitable when it is needed to watch a log file, there should be some mechanism to print the clusters

Periodically, with a flag on the command line to tweak the interval
Allows to handle a signal (maybe allowing to pick a signal from the os supported signal kill -l, without overriding the standard ones that are handled by the JVM already)
In the running tty, send a key combination, like ctrl+d

Replace JVM file watcher by alternatives that are not affected by filesystems boundaries

The watchservice of the JVM suffers from a few drawbacks regarding its integration with the OS. In Linux in particular events of bind mounts are not received.

Let's investigate alternative, in particular the gradle native integration : https://github.com/gradle/native-platform

+    implementation("net.rubygrapefruit:file-events:0.22")
+    implementation("net.rubygrapefruit:native-platform:0.22")

Currently file watching capabilities just appeared in a 0.22 milestone, unfortunately this is not completely released (platform specific native libraries are not published on bintray (for the published milestone)).

To follow https://github.com/gradle/native-platform/releases

However native-platform:0.21 is available on it's possible to play with some api like the terminal or files, e.g. :

try {
    Terminals terminals = Native.get(Terminals.class);
    var isTerminal = terminals.withAnsiOutput().isTerminal(Output.Stdout);

    if (isTerminal) {
        var terminal = terminals.withAnsiOutput().getTerminal(Output.Stdout);
        terminal.write("Hello");
        SECONDS.sleep(5);
        terminal.cursorStartOfLine()
                .clearToEndOfLine()
                .bold().write("Bold hello")
                .reset();
    }
} catch (InterruptedException e) {
    Thread.currentThread().interrupt();
}

Handle multiline log event (stack traces)

Currently the code is only able to process single line log messages. However it's possible to have multiline log messages.

Scope

In particular this ticket is about handling stacktraces, which usually starts with with whitespaces. I am not familiar with stacktraces in other languages, so the goal of this ticket is to focus on Java stack traces that may appear in a log trail.

Out of scope

Multi line log messages not starting by a whitespace
Stacktraces of other than Java

Resolve `MappedFileLineReaderTest` on Windows and macOs builds

On Windows the build fails with

MappedFileLineReaderTest.find_start_position_given_last_lines()

org.opentest4j.AssertionFailedError:
expected: 42L
but was : 43L
MappedFileLineReaderTest.can_read_from_position()

org.opentest4j.AssertionFailedError:
expected: 183L
but was : 186L
MappedFileLineReaderTest.should_watch_with_channel_sink(Path)

java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit8439001560896197356. The following paths could not be deleted (see suppressed exceptions for details): , test4653189040998961269log

On MacOS the build fails with

MappedFileLineReaderTest.should_watch_with_channel_sink(Path)

org.opentest4j.AssertionFailedError:
expected: 592L
but was : 38L

Investigate memory usage with JfrUnit

Gunnar Morling started a project there to characteristics using JFR events : https://github.com/gunnarmorling/jfrunit

  repositories {
      jcenter()
+     maven { setUrl("https://jitpack.io") }
  }

  dependencies {
+     testImplementation("com.github.gunnarmorling:jfrunit:main-SNAPSHOT")
  }

Publish an actual release

Hint: use https://jreleaser.org/

Document Drain-Java

Currently I only offered hints in the README, but I definitely need to spend some time on the documentation of

Drain algorithm
Drain java API
Drain usage

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

github-actions

.github/workflows/gradle.yml

actions/checkout v4

gradle/wrapper-validation-action v2

actions/checkout v4

actions/setup-java v4

gradle/actions v3

actions/upload-artifact v4

actions/download-artifact v4

mikepenz/action-junit-report v4

actions/checkout v4

actions/setup-java v4

gradle/actions v3

gradle

settings.gradle.kts

build.gradle.kts

drain-java-bom/build.gradle.kts

drain-java-core/build.gradle.kts

drain-java-jackson/build.gradle.kts

gradle/libs.versions.toml

com.google.code.findbugs:jsr305 3.0.2

info.picocli:picocli 4.7.5

info.picocli:picocli-codegen 4.7.5

com.fasterxml.jackson.core:jackson-core 2.17.0

com.fasterxml.jackson.core:jackson-annotations 2.17.0

com.fasterxml.jackson.core:jackson-databind 2.17.0

org.assertj:assertj-core 3.25.3

org.junit.jupiter:junit-jupiter-api 5.10.2

org.junit.jupiter:junit-jupiter-engine 5.10.2

de.undercouch.download 5.6.0

com.github.johnrengelman.shadow 8.1.1

com.github.ben-manes.versions 0.51.0

com.github.hierynomus.license 0.16.1

com.github.vlsi.gradle-extensions 1.90

nebula.release 19.0.6

tailer/build.gradle.kts

gradle-wrapper

gradle/wrapper/gradle-wrapper.properties

gradle 8.7

Check this box to trigger a request for Renovate to run again on this repository

Adds a mechanism to retain log event metadata such as the severity

Drain algorithm is a log mining algorithm, it's idea is to find patterns and group similar log event's message.

The good practice is to make the miner to process only the message part, ie strip elements like the date, the severity, the thread, the name.

Yet it might be interesting to keep some of these information. For example the severity or the log name, are unlikely to have a high cardinality, and maybe good candidate as log cluster metadata.

2021-03-29 12:55:24.172 [] DEBUG --- [  restartedMain] o.s.b.w.s.ServletContextInitializerBeans : Mapping filters: filterRegistrationBean urls=[/*] order=-2147483647, requestContextFilter urls=[/*] order=-1, contextServletRequestFilter urls=[/*] order=-2147483648, characterEncodingFilter urls=[/*] order=-2147483648, edgeRequestContextFilter urls=[/*] order=-2147483646, hideEdgeTechnicalEndpointsFilter urls=[/*] order=-2147483646, enableDebugLogsFilter urls=[/*] order=-2147483645, newrelicTransactionsFilter urls=[/*] order=-2147483645, accountingFilter urls=[/*] order=-2147483644, formContentFilter urls=[/*] order=-9900, disabledForwardedHeaderFilter urls=[/*] order=2147483647
2021-03-29 12:55:24.173 [] DEBUG --- [  restartedMain] o.s.b.w.s.ServletContextInitializerBeans : Mapping servlets: metricsService urls=[/metrics], dispatcherServlet urls=[/rest/*, /doc/*, /actuator/*, /error/*, /favicon.ico], com.blablacar.common.java.web.JerseyConfig urls=[/*]
2021-03-29 12:55:24.554 [] INFO  --- [  restartedMain] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 3 endpoint(s) beneath base path '/actuator'
2021-03-29 12:55:24.606 [] INFO  --- [  restartedMain] o.s.s.concurrent.ThreadPoolTaskExecutor  : Initializing ExecutorService 'applicationTaskExecutor'
2021-03-29 12:55:24.972 [] INFO  --- [  restartedMain] o.s.b.d.a.OptionalLiveReloadServer       : LiveReload server is running on port 35729