Coder Social home page Coder Social logo

pedroegsilva / gofindthem Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 1.63 MB

Gofindthem is a go library that implements a domain-specific language(DSL), which is similar to a logical expression. It also uses a substring matching engine (implementations of Aho-Corasick) and a regex matching engine to enable more complex searches with easy-to-read expressions. It supports multiple expressions searches, making it an efficient way to "classify" documents according to the expressions that were matched.

License: MIT License

Starlark 4.54% Go 95.46%
parser go matching-engine regex-util dsl domain-specific-language logical-expression

gofindthem's Introduction

gofindthem

Gofindthem is a go library that implements a domain-specific language(DSL), which is similar to a logical expression. It also uses a substring matching engine (implementations of Aho-Corasick) and a regex matching engine to enable more complex searches with easy-to-read expressions. It supports multiple expressions searches, making it an efficient way to "classify" documents according to the expressions that were matched.

Brief History

This project was conceived in 2019 while I was searching for a way to match multiple rules on millions of documents in a more efficient way than matching multiple regex.

In my researches, I found this article written by Vikash Singh. He described a problem similar to mine, where the use of a set of regex needed to be applied on many documents, and this was quite slow. In his case, he needed to replace the found terms for another, and I needed to use the found terms to check if a set of rules were met.

In the Vikash Singh article I learned about the Aho-Corasick algorithm, which solved the "matching of multiple terms efficiently", but it didn't solve my problem completely. The problem was that the rules that were applied needed to be managed by a team of analysts, so I couldn't just hard code the set of rules since I would have to change them constantly. The other problem was that the team used to use regex to classify those documents. Because of that, I needed a syntax that was easy enough to convince them to change.

The idea was to create a DSL that would have the operators "AND", "OR" and "NOT", which are the same as the logical operations, and a new operator "INORD", that would check if the terms were found in the same order that they were specified. This would allow searches that used the regex foo.*bar to be replaced with INORD("foo" and "bar") and the combination of regex foo.*bar and bar.*foo to be replaced as "foo" and "bar". This does not replace all regex, but it would cover most use cases that I had back then. For those cases that only regex would solve, I added a way to represent a regex with the syntax R"foo.*bar", making each kind of term use its respective engine to find its matches and reducing the need to use regex for everything.

This repository is the golang implementation of this idea.

The scanner and parser from the DSL are heavily influenced by this article written by Ben Johnson, from which is heavily influenced by the InfluxQL parser.

Usage/Examples

There are 3 libraries/packages on this repository, the DSL, the Finder and the GroupFinder.

Finder

The finder is used to manage multiple expressions. It will use the DSL to extract the terms and regex from each expression and use them to process the text with the appropriate engine.

You will need to create the Finder object. The Finder needs a SubstringEngine (interface can be found at /finder/substringEngine.go) and a RegexEngine (interface can be found at /finder/regexEngine.go) and if the search will be case sensitive or not. There are 3 "implementations" of SubstringEngine that uses the libraries from https://github.com/cloudflare/ahocorasick, https://github.com/anknown/ahocorasick and https://github.com/pedroegsilva/ahocorasick (fork from cloudflare with the addition for matching positions) and a regexp implementation for the RegexEngine. But any other library can be used as long as it "implements" the SubstringEngine or RegexEngine interface.

    subEng := &finder.CloudflareForkEngine{}
    rgxEng := &finder.RegexpEngine{}
    caseSensitive := true
    findthem := finder.NewFinder(subEng, rgxEng, caseSensitive)

Then you need to add the expressions that need to be solved and a tag for that expression.

	if err := findthem.AddExpressionWithTag(`r"Lorem" and "ipsum"`, "test"); err != nil {
		log.Fatal(err)
	}

	if err := findthem.AddExpressionWithTag(`("Nullam" and not "volutpat")`, "test2"); err != nil {
		log.Fatal(err)
	}

	if err := findthem.AddExpressionWithTag(`"lorem ipsum" AND ("dolor" or "accumsan")`, "test"); err != nil {
		log.Fatal(err)
	}

	if err := findthem.AddExpression(`"purus.\nSuspendisse"`); err != nil {
		log.Fatal(err)
	}

	if err := findthem.AddExpression(`inord("Lorem" and "FOO")`); err != nil {
		log.Fatal(err)
	}

And finally you can check which expressions were match on each text.

	for i, text := range texts {
		resp, err := findthem.ProcessText(text)
		if err != nil {
			log.Fatal(err)
		}
		fmt.Printf("----------Text %d case sensitive-----------\n", i)
		for _, expRes := range resp {
			fmt.Printf("exp %d: [%s]%s\n", expRes.ExpresionIndex, expRes.Tag, expRes.ExpresionStr)
		}
	}

The full example can be found at /examples/finder/main.go

GroupFinder

The Group finder is a package that adds another DSL to improve the maintainability of the searched patterns and enables searches on specific fields of structured documents. It allows the configuration to be split into 2 categories(rules and tags) so that the tags can be used by multiple rules. The Rules also enables to check if a given tag was found on a specific field for a structured document. You can find more about the usage of the Group Finder at its README

DSL

Definition

The DSL uses 5 operators (AND, OR, NOT, R, INORD), terms (defined by "") and parentheses to form expressions. A valid expression can be:

  • A single term. Eg: "some term"
  • The result of an operation. R"term 1"
  • An expression enclosed by parentheses ("term 1" AND "term 2")

Each operator functions as the following:

  • AND - Uses the expression before and after it to solve them as a logical AND operator.

    (valid expression) AND (valid expression) eg: "term 1" AND "term 2"

  • OR - Uses the expression before and after it to solve them as a logical OR operator.

    <valid expression> OR <valid expression> eg: "term 1" OR "term 2"

  • NOT - Uses the expression after it to solve them as a logical NOT operator.

    NOT <valid expression> eg: NOT "term 1"

  • R - Defines the next term as a regex.

    WARNING - To scape regex operator characters please use \\ instead of \ since the single reverse bar is used to scape the char " on the DSL.

    R <term> eg: R"term1\\."

  • INORD - Needs to be followed by an expression enclosed in parentheses. This operator will check if there is a set of terms on the document that satisfy the same order of the enclosed terms. It will still solve the logical expressions but it will return false if the terms were not found in the defined order. Note that the OR operator enclosed on the INORD operator will consider that at least one of the terms must be found and in order. For example INORD("a" and "b") or INORD("a" and "c") is equivalent to INORD("a" and ("b" or "c"))

    WARNING - The NOT and INORD operators are not permitted on the expression that is enclosed by the INORD operator. Because the expression INORD(NOT "a" and "b") doesn't make sense and another INORD would be redundant.

    INORD(<valid expression>) eg: INORD("term 1" AND ("term 2" or "term 3"))

Package

To use this package as a stand-alone, you will need to create the parser object. The parser needs a reader with the expression that will be parsed and if it will be case-sensitive.

    caseSensitive := false
    p := dsl.NewParser(strings.NewReader(`INORD("foo" and "bar" and ("dolor" or "accumsan"))`), caseSensitive)

Then you can parse the expression

    expression, err := p.Parse()
    if err != nil {
        log.Fatal(err)
    }

Once parsed you can extract which terms there were on the expression.

    keywords := p.GetKeywords()
    fmt.Printf("keywords:\n%v\n", keywords)

And which Regexes there were on the expression.

    regexes := p.GetRegexes()
    fmt.Printf("regexes:\n%v\n", regexes)

You can also get a pretty format to see the created Abstract Syntax Tree (AST).

    fmt.Printf("pretty format:\n%s\n", expression.PrettyFormat())

To solve the expression we need a map of the terms that were matched and the position of each match. The positions are needed to solve the INORD operator and must be sorted to work properly, if there is no INORD operator in the expression, an empty array or a nil value will suffice.

solving example:

	matches := map[string][]int{
		"foo":   {0, 2, 5},
		"bar":   {3},
		"dolor": {1, 7},
	}

	response, err := expression.Solve(matches)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("eval ", response)

The complete example can be found at /examples/dsl/main.go

Run Locally

This project uses Bazel to build and test the code. You can run this project using go as well.

What is Bazel?

"Bazel is an open-source build and test tool similar to Make, Maven, and Gradle. It uses a human-readable, high-level build language. Bazel supports projects in multiple languages and builds outputs for multiple platforms. Bazel supports large codebases across multiple repositories and large numbers of users." - from https://docs.bazel.build/versions/4.2.0/bazel-overview.html.

To install Bazel go to https://docs.bazel.build/versions/main/install.html.

Now with Bazel installed.

Clone the project.

  git clone https://github.com/pedroegsilva/gofindthem.git

Go to the project directory

  cd gofindthem

You can run the examples with the following commands.

  bazel run //examples/finder:finder
  bazel run //examples/dsl:dsl

To run all the tests use the following.

bazel test //...

To run all the benchmarks use the following.

bazel run //benchmarks:benchmarks_test -- -test.bench=. -test.benchmem

gofindthem's People

Contributors

pedroegsilva avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.