Coder Social home page Coder Social logo

drain-java's Introduction

drain java

Java CI with Gradle

Introduction

drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.

Usage

First, Java 11 is required to run drain-java.

As a dependency

You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core, currently only snapshots are available by adding this repository.

repositories {
    maven {
        url("https://oss.sonatype.org/content/repositories/snapshots/")
    }
}

From command line

Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it’s not a finished product, anything could change.

Example usage
$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
            [--parse-after-str=FIXED_STRING_SEPARATOR]
            [--parser-after-col=COLUMN] FILE
...
      FILE          log file
  -d, --drain       use DRAIN to extract log patterns
  -f, --follow      output appended data as the file grows
  -h, --help        Show this help message and exit.
  -n, --lines=NUM   output the last NUM lines, instead of the last 10; or use
                      -n 0 to output starting from beginning
      --parse-after-str=FIXED_STRING_SEPARATOR
                    when using DRAIN remove the left part of a log line up to
                      after the FIXED_STRING_SEPARATOR
      --parser-after-col=COLUMN
                    when using DRAIN remove the left part of a log line up to
                      COLUMN
  -V, --version     Print version information and exit.
      --verbose     Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64

By default, the tool act similarly to tail, and it will output the file to the stdout. The tool can follow a file if the --follow option is passed. However, when run with the --drain this tool will classify log lines using DRAIN, and will output identified clusters. Note that this tool doesn’t handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
  -d \ (1)
  -n 0 \ (2)
  --parse-after-str "]: " (3)
  build/resources/test/SSH.log (4)
  1. Identify patterns in the log

  2. Starts from the beginning of the file (otherwise it starts from the last 10 lines)

  3. Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively ignoring some variable elements like the time.

  4. The log file

log pattern clusters and their occurences
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters (1)
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 (2)
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...
  1. 51 types of logs were identified from 655147 lines in 1.588s

  2. There was 140768 similar log messages with this pattern, with 3 positions where the token is identified as parameter <*>.

On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.

From Java

This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:

Minimal DRAIN example
var drain = Drain.drainBuilder()
                 .additionalDelimiters("_")
                 .depth(4)
                 .build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
            StandardCharsets.UTF_8)
     .forEach(drain::parseLogMessage);

// do something with clusters
drain.clusters();

Status

Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here’s what I would like to do.

Todo
  • ❏ More unit tests

  • ✓ Wire things together

  • ❏ More documentation

  • ✓ Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)

  • ❏ In follow drain mode dump clusters on forced exit (e.g. for example when hitting ctrl+c)

  • ✓ Start reading from the last x lines (like tail -n 30)

  • ❏ Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)

For later
  • ❏ Json message field extraction

  • ❏ How to handle prefixes : Dates, log level, etc. ; possibly using masking

  • ❏ Investigate marker with specific behavior, e.g. log level severity

  • ❏ Investigate log with stacktraces (likely multiline)

  • ❏ Improve handling of very long lines

  • ❏ Logback appender with micrometer counter

Motivation

I was inspired by a blog article from one of my colleague on LogMine, — many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of Datadog’s Log explorer, his blog post sparked my interest.

After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn’t yield any result, although I certainly didn’t search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.

References

The Drain port is mostly a port of Drain3 done by IBM folks (David Ohana, Moshik Hershcovitch). IBM’s Drain3 is a fork of the original work done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.

I didn’t follow up on other contributors of these projects, reach out if you think you have been omitted.

For reference here’s the linked I looked at:

drain-java's People

Contributors

bric3 avatar renovate-bot avatar renovate[bot] avatar todorkriv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

drain-java's Issues

Question

Hello, did you port https://github.com/IBM/Drain3 or original logpai implementation?
In readme - you mension IBM gays, but i see code mismaches (may be it's because IBM project activly updated).

Process Json log events

Currently the code understand a line as a log event message. However in some production systems, the application can use structured logging using a Json document. The document may contains many additional metadata, or context data, but what drain is interested in is the message, for this reason it has to be able to extract the message from the document given a path.

Usually log events are serialized in a single line json document, see logstash-logback-encoder for example. So when parsing json, the events will be assumed to be single line. However the message itself may be multiline (new line are likely encoded as \n).

So the only work to do is to pre-process the string line as a Json and extract the message field.

Enable cluster lookup given a log message

In #5 introced a interesting feature to look for the cluster of a certain log.

The method findLogMessage will only look for an existing log cluster. This might be interesting to implement for search feature.

Thanks to @TodorKrIv for the idea and initial implementation.

Allows to pass custom masks

Currently the code uses really simply tricks to mask some elements of a log event, eg by stripping the date component.

However there are other dynamic log components that may be worth to mask, IPs, UUIDs, etc.

Print _log clusters_ periodically or on a signal or key combination

Currently in drain mode, the discovered log clusters are only dumped (printed) when the log file has been entirely read.

This is not suitable when it is needed to watch a log file, there should be some mechanism to print the clusters

  • Periodically, with a flag on the command line to tweak the interval
  • Allows to handle a signal (maybe allowing to pick a signal from the os supported signal kill -l, without overriding the standard ones that are handled by the JVM already)
  • In the running tty, send a key combination, like ctrl+d

Replace JVM file watcher by alternatives that are not affected by filesystems boundaries

The watchservice of the JVM suffers from a few drawbacks regarding its integration with the OS. In Linux in particular events of bind mounts are not received.

Let's investigate alternative, in particular the gradle native integration : https://github.com/gradle/native-platform

+    implementation("net.rubygrapefruit:file-events:0.22")
+    implementation("net.rubygrapefruit:native-platform:0.22")

Currently file watching capabilities just appeared in a 0.22 milestone, unfortunately this is not completely released (platform specific native libraries are not published on bintray (for the published milestone)).

To follow https://github.com/gradle/native-platform/releases

However native-platform:0.21 is available on it's possible to play with some api like the terminal or files, e.g. :

try {
    Terminals terminals = Native.get(Terminals.class);
    var isTerminal = terminals.withAnsiOutput().isTerminal(Output.Stdout);

    if (isTerminal) {
        var terminal = terminals.withAnsiOutput().getTerminal(Output.Stdout);
        terminal.write("Hello");
        SECONDS.sleep(5);
        terminal.cursorStartOfLine()
                .clearToEndOfLine()
                .bold().write("Bold hello")
                .reset();
    }
} catch (InterruptedException e) {
    Thread.currentThread().interrupt();
}

Handle multiline log event (stack traces)

Currently the code is only able to process single line log messages. However it's possible to have multiline log messages.

Scope

In particular this ticket is about handling stacktraces, which usually starts with with whitespaces. I am not familiar with stacktraces in other languages, so the goal of this ticket is to focus on Java stack traces that may appear in a log trail.

Out of scope

  • Multi line log messages not starting by a whitespace
  • Stacktraces of other than Java

Resolve `MappedFileLineReaderTest` on Windows and macOs builds

On Windows the build fails with

  • MappedFileLineReaderTest.find_start_position_given_last_lines()

    org.opentest4j.AssertionFailedError:
    expected: 42L
    but was : 43L

  • MappedFileLineReaderTest.can_read_from_position()

    org.opentest4j.AssertionFailedError:
    expected: 183L
    but was : 186L

  • MappedFileLineReaderTest.should_watch_with_channel_sink(Path)

java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit8439001560896197356. The following paths could not be deleted (see suppressed exceptions for details): , test4653189040998961269log

On MacOS the build fails with

  • MappedFileLineReaderTest.should_watch_with_channel_sink(Path)

org.opentest4j.AssertionFailedError:
expected: 592L
but was : 38L

Document Drain-Java

Currently I only offered hints in the README, but I definitely need to spend some time on the documentation of

  • Drain algorithm
  • Drain java API
  • Drain usage

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

github-actions
.github/workflows/gradle.yml
  • actions/checkout v4
  • gradle/wrapper-validation-action v2
  • actions/checkout v4
  • actions/setup-java v4
  • gradle/actions v3
  • actions/upload-artifact v4
  • actions/download-artifact v4
  • mikepenz/action-junit-report v4
  • actions/checkout v4
  • actions/setup-java v4
  • gradle/actions v3
gradle
settings.gradle.kts
build.gradle.kts
drain-java-bom/build.gradle.kts
drain-java-core/build.gradle.kts
drain-java-jackson/build.gradle.kts
gradle/libs.versions.toml
  • com.google.code.findbugs:jsr305 3.0.2
  • info.picocli:picocli 4.7.5
  • info.picocli:picocli-codegen 4.7.5
  • com.fasterxml.jackson.core:jackson-core 2.17.0
  • com.fasterxml.jackson.core:jackson-annotations 2.17.0
  • com.fasterxml.jackson.core:jackson-databind 2.17.0
  • org.assertj:assertj-core 3.25.3
  • org.junit.jupiter:junit-jupiter-api 5.10.2
  • org.junit.jupiter:junit-jupiter-engine 5.10.2
  • de.undercouch.download 5.6.0
  • com.github.johnrengelman.shadow 8.1.1
  • com.github.ben-manes.versions 0.51.0
  • com.github.hierynomus.license 0.16.1
  • com.github.vlsi.gradle-extensions 1.90
  • nebula.release 19.0.6
tailer/build.gradle.kts
gradle-wrapper
gradle/wrapper/gradle-wrapper.properties
  • gradle 8.7

  • Check this box to trigger a request for Renovate to run again on this repository

Adds a mechanism to retain log event metadata such as the severity

Drain algorithm is a log mining algorithm, it's idea is to find patterns and group similar log event's message.

The good practice is to make the miner to process only the message part, ie strip elements like the date, the severity, the thread, the name.

Yet it might be interesting to keep some of these information. For example the severity or the log name, are unlikely to have a high cardinality, and maybe good candidate as log cluster metadata.

2021-03-29 12:55:24.172 [] DEBUG --- [  restartedMain] o.s.b.w.s.ServletContextInitializerBeans : Mapping filters: filterRegistrationBean urls=[/*] order=-2147483647, requestContextFilter urls=[/*] order=-1, contextServletRequestFilter urls=[/*] order=-2147483648, characterEncodingFilter urls=[/*] order=-2147483648, edgeRequestContextFilter urls=[/*] order=-2147483646, hideEdgeTechnicalEndpointsFilter urls=[/*] order=-2147483646, enableDebugLogsFilter urls=[/*] order=-2147483645, newrelicTransactionsFilter urls=[/*] order=-2147483645, accountingFilter urls=[/*] order=-2147483644, formContentFilter urls=[/*] order=-9900, disabledForwardedHeaderFilter urls=[/*] order=2147483647
2021-03-29 12:55:24.173 [] DEBUG --- [  restartedMain] o.s.b.w.s.ServletContextInitializerBeans : Mapping servlets: metricsService urls=[/metrics], dispatcherServlet urls=[/rest/*, /doc/*, /actuator/*, /error/*, /favicon.ico], com.blablacar.common.java.web.JerseyConfig urls=[/*]
2021-03-29 12:55:24.554 [] INFO  --- [  restartedMain] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 3 endpoint(s) beneath base path '/actuator'
2021-03-29 12:55:24.606 [] INFO  --- [  restartedMain] o.s.s.concurrent.ThreadPoolTaskExecutor  : Initializing ExecutorService 'applicationTaskExecutor'
2021-03-29 12:55:24.972 [] INFO  --- [  restartedMain] o.s.b.d.a.OptionalLiveReloadServer       : LiveReload server is running on port 35729

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.