timbray / quamina Goto Github PK

View Code? Open in Web Editor NEW

376.0 376.0 18.0 43.01 MB

Home of Quamina, a fast pattern-matching library in Go

License: Apache License 2.0

Go 100.00%

event-driven events go golang json

quamina's People

Contributors

Stargazers

Watchers

Forkers

jsmorph devdoshi goodhut embano1 quaeler gabrielreid tuananh jhawk28 thachnv92 thailx kibalion2427 isgasho rudygt kylemcc yosiat luis-pinto-fanduel

quamina's Issues

Add repository badges

For newcomers to the repo/project it's nice to see a repo status and have quick links, e.g. to go docs, releases, etc. For an example, see here: https://github.com/embano1/ci-demo-app#project-status-

Depends on #5 and #6

MatchesForJSONEvent isn't thread-safe

The unit tests cover M4JE vs AddPattern reasonably well, but not M4JE vs M4JE. It won't work, because currently the Matcher has a "flattener" field and the first step in matching is to call m.flattener.reset(), which zeroes out all the fields, because these things are reasonably expensive to construct. Each matching thread needs to have its own flattener. So that field needs to be ripped out of Matcher and the API needs to be enriched to fetch a Flattener to use on events before trying to match them.

The alternative would be making the Flatten method single-threaded but that would be terrible, it should parallelize beautifully assuming each thread has its own flattener.

`Flattener` interface `reset` not exported

Is this a typo or intentional?

quamina/core/flattener.go

Line 38 in 875c269

reset()

Add `GET` and `LIST` APIs for retrieving registered patterns

Currently in quamina when using the default matcher, i.e. quamina.New() there is no way to retrieve the registered patterns, e.g. for debugging, inspection, pattern handling (review/replace during runtime, etc.).

The following APIs would be useful (just a proposal, need to figure return types):

func (q *Quamina) GetPattern(x X) (string, error) (note: assumes pattern is of type string)
func (q *Quamina) ListPatterns() ([]string, error) (note: assumes pattern is of type string)

pat: Implement prefix pattern

Note that the following are equivalent

{ "a": [ {"shellstyle": "foo*"} ] }

{"a": [ {"prefix": "foo"} ] }

It's already special-cased in makeShellStyleAutomaton. So there are a variety of ways this could be done, ranging from just re-writing the pattern to fixing up the switch on ValueType in AddPattern. But no new automaton construction should be required.

chore: docstrings need work

There are incompletely described exposed methods and fields, and some bad grammar and broken sentences.

Matcher creation function name

In core, we have NewCoreMatcher() but in Pruner we have pruner.NewMatcher(). These should be consistent with each other.

Patterns support - `IN` and array overlapping

Hi!

Thanks for this beautiful amazing library 🚀 I have built internally a simple pattern matching library (using inverted indices and simple counting) which lacks some features that Quamina have and I want to explore Quamina to see if I can replace my own library in terms of correctness & performance.

To do that I need to support 4 operators which two of them already exists (Equality, Exists) but IN and array overlapping doesn't.

Examples:

IN

Event:

{
    "event": "machine:created"
}

Pattern:

{
     "event": ["machine:created", "machine:deleted", "machine:updated"]
}

So any event which is one of the above will yield a match

Array overlapping

Event:

{
    "customers": ["a", "b", "c", "f"]
}

Pattern:

{
     "customers": ["c", "d", "a", "y"]
}

In this case we will have a match since we have overlapping (with c, d and a)

Is it possible to support those operators? I am still reading the code and trying to figure that out on my own.. If it's possible to support and some guidance will be given I will try to implement this on my own.

Update README for release 0.1.0

Add LFS support to actions

Add example for godoc

Create testable example which also shows up in godoc/https://pkg.go.dev/

code: Clean up atomic.Value "updateable" handling

each of coreMatcher, fieldMatcher, and valueMatcher have an "updateable" field that is an atomic.Value used to preserve thread-safety while updates are in progress. But the naming of variables and functions isn't consistent or sensible.

Review exposed types and methods

After the restructuring IMHO it would make sense to review all exposed types, vars, functions and methods to verify:

a) they have to be exposed (you know, part of the API, breaking changes, etc.)
b) all exposed objects have proper code comments/follow Go conventions regarding naming

Add Release Automation

Trigger Create Github Release on push of v* tag
Create release notes (using git-chglog

Enable Github Actions

As per a discussion with @timbray it would be nice to leverage Github Actions for same basic automation around unit tests, PR reviews, etc. Later this can be extended for release automation, package management (if needed), and so on.

api: Implement anything-but pattern

(compatible with EventBridge)

Add repo to awesome-go

awesome-go drives a lot of interest and views, it‘s generally great to be listed there.
Requirements: https://github.com/avelino/awesome-go/blob/main/CONTRIBUTING.md

IMHO the only thing missing is code coverage which we can easily enable through the CI automation (already in place).

shellstyle * is broken

Probably need to switch to an NFA implementation to make it correct.

Support for custom functions for pattern checking

Quamina already seems to support the use of some functions to define whether an event can be deemed to match a pattern, such as the "exists" or "shellstyle" functions.

I think that we were to make this user-definable, then it will be a significant addition to the capabilities of quamina in different scenarios, where the native pattern matching does not support that kind of functionality.
A simple example is greater-than,less-than etc. (which of course is trivial to implement for integers, but then those operators can mean different things in different application settings).

Another simple example of such a user-defined function could be "is-ip-address", "is-part-of-subnet" etc.

Using such a user-defined function, i am imagining a pattern as follows

{"Image": { "Width": [ { "is_greater_than": 200 },  { "is_less_than": 1000 } ] } }

I have not yet tested, if Quamina supports multiple such checks, but if they are not supported yet - that could be a different enhancement.

Quamina could support a new API like registerPatternOperator which will take a keyword ("is_greater_than" in the above example), and a function implementation (subject to the required function signature). In the first iteration, such a function could return a simple boolean (or any additional information that you might need). Based on the response, the core quamina pattern matching engine can deem this sub-pattern to have matched or not.

Make the flattener smarter about skipping fields

Continuing the discussion from this issue: #108

Currently the flattener is fast and performs well on cases where the searched fields (from the patterns) are covering most of the event payload, in cases of large payloads and the searched fields are small subset of it, the flattener can be improved.

Proposed Solution and POC results

My proposed solution consists of maintaining paths index being in use (as a tree data structure) and POC flattener written using Jx.

The source resides here - https://github.com/yosiat/quamina/tree/flat-hack inside "paths.go" and "jx.go", it passes most of the existing flattener tests except the error handling.

Benchmark results:

goos: darwin
goarch: arm64
pkg: github.com/yosiat/quamina-flatenner
Benchmark_Quamina_JxFlattener-10                 2612365               437.9 ns/op            24 B/op          3 allocs/op
Benchmark_Quamina_Flattener-10                     83246             14347 ns/op             360 B/op         23 allocs/op
Benchmark_LargePayload_QuaminaFlattner-10           78272             15307 ns/op            1240 B/op         44 allocs/op
Benchmark_LargePayload_JxFlattner-10               876013              1365 ns/op             904 B/op         24 allocs/op
PASS
ok      github.com/yosiat/quamina-flatenner     5.727s

source: #108 (comment)
Those benchmarks can't be shared currently since they are using an internal data, but can be adjusted and shared if needed.

BigShellStyle benchmark (in benchmarks_test.go) -

# Quamina
Field matchers: 27 (avg size 1.000, max 1), Value matchers: 1, SmallTables 54 (avg size 15.500, max 28), singletons 0
428,547.72 matches/second with letter patterns

# Jx
Field matchers: 27 (avg size 1.000, max 1), Value matchers: 1, SmallTables 54 (avg size 15.500, max 28), singletons 0
1,367,947.02 matches/second with letter patterns

And in CityLots benchmarks where we are pattern matching against the whole event data, we see there is no difference -

=== RUN   TestCityLots
Field matchers: 7 (avg size 1.500, max 3), Value matchers: 6, SmallTables 0 (avg size n/a, max 0), singletons 6

173,434.09 matches/second

--- PASS: TestCityLots (1.64s)
=== RUN   TestCityLots_JX
Field matchers: 7 (avg size 1.500, max 3), Value matchers: 6, SmallTables 0 (avg size n/a, max 0), singletons 6

177,609.63 matches/second

Open Questions & Notes

As discussed we won't use Jx in Quamina, since we don't want to use an external dependencies - it's being used only for POC purposes, and once we have a general idea on how to make skipping smarter & faster in large payloads we will implement it in existing flattener.
Current discussion and open question is how to pass the information about needed paths to pluck.

Organization of tests

We have a lot of tests and some of them really aren’t “unit tests” but focused on coverage and concurrency, etc. Speaking as a developer, here are the things I want to do.

Run unit tests in the directory I'm working in, the equivalent of go test.
Run concurrency tests across all the directories. This is important since thread-safety is an important and advertised feature.
Coverage tests, to support our coverage badge
Linting tests
Acceptance tests, including all of the above, when you think you’re ready to do a PR

I suppose the main activities would be (1), (2), and (3,4,5)

Notes.

For (1) we absolutely need to run in <10sec, so that a developer can run them every time they save a tiny change
(1) should include one benchmark test to catch performance regressions, which can easily creep in even on a minor change. I think TestCityLots is a good choice since it explores worst-case performance of the Flattener and Matcher.
The -race flag really slows things down, in particular the concurrency tests. I don't think it's really needed in case (1)?

If it were 20 years ago, I think I’d say “we need a Makefile”. Um… I think we need a Makefile?

Formalize automaton processing

There is quite a bit of complexity in supporting both NFAs and DFAs. This is not strictly necessary since there are well-known algorithms for converting NFAs to DFAs. This might make AddPattern more expensive, but this should be investigated and benchmarked.

Similarly, anything_but.go is complex and hard to understand. Also perhaps not necessary because there are well-known arguments for taking the complement of a DFA and for doing intersections (I think?) differences. Similarly, the AddPattern cost should be investigated.

go.mod should define the module name with complete github.com path

It seems that the way the module name is defined in go.mod, it is not possible to easily use this package from other golang projects.
Is there reason, that the module name is not defined as follows?

module github.com/timbray/quamina

go 1.18

Alternatively, if there is an easy way to use this package without this change, then could it be added to the README?

Add Github repo description and tags for easy discovery

This needs to be done by repo owner

Add context.Context arg to Matcher methods?

Should Matcher interface methods take context.Contexts as their first arguments?

Rationale: Some underlying implementations (e.g., Pruner with State that does I/O) might want to have a context (for either timeouts/deadlines or context values), and some fancy matchers that do async/background work might want a context for timeouts in selects.

docs: README doesn't discuss error-handling in machine APIs

This has just been changed with the recent flatten_json commit and the description of error-handling should be updated, but there is no description.

Remove LFS stuff & citylots.jlines

The huge file testdata/citylots.jlines is no longer needed, tests have been updated to use the compressed form. Will remove before too long. A lot of the CI stuff still has LFS enabled. Is having that in there a problem, i.e. should we take it out?

Linter broken due to wrong go version

Improve test coverage

Coverage is only 83% overall, not good enough. The main culprit is lack of tests that check error returns, especially in fj.go, which is a huge file and doesn't have tests for any of the very large number of possible syntax errors. Also numbers.go is very weak (even though it is not actually in production yet).

Create Protobufs flattener

Allow Quamina to filter protobufs message traffic

bug: adding shellstyle patterns can burn time and memory

If you add a sequence of random shellstyle patterns Quamina quickly starts generating huge automata and AddPattern starts slowing down dramatically to the point where it's basically unusable.

My test words that produced this effect: "aahed*", "aalii", "aargh", "aarti*", "abaca", "abaci", "aback", "abacs", "abaft", "abaka", "abamp", "aband", "abase", "abash", "abask", "abate", "abaya", "abbas", "abbed*", "ab*bes"

broken badges

The "go report" and "Go v1.18" badges are broken.

Event with empty array rejected?

I'm getting the error "at line 1 col 6: illegal character ] in array" when asking MatchesForEvent with an event that has a property value that's a zero-length array. Seems like [] should be a legal value.

Example in the playground:

func main() {
	event := `{"a":[]}`
	q, _ := quamina.New()
	if _, err := q.MatchesForEvent([]byte(event)); err != nil {
		log.Fatal(err)
	}
}

Output:

2009/11/10 23:00:00 at line 1 col 6: illegal character ] in array

I'll try to dig in.

Consider using RE2 (aka golang `regexp`) for wildcard matches

You might be interested to know that the Go regexp library uses finite automata already to implement the regexp package. From that package:

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. ... For more information about this property, see

https://swtch.com/~rsc/regexp/regexp1.html

It should be trivial(1) to replace the current shellstyle matcher with a translation to the regexp module, and to check for performance differences.

(1) exercise left to the reader

The one place where shellstyle might be faster than regexp is if you require start and end matching and allow only one glob in the middle. At that point, shellstyle becomes a shortcut for startswith("stuff before *") AND endswith("stuff after *") AND len(string) > len(expression).

Implement I-Regexp

The IETF JSONPath WG is building a minimal-ish regular expression specification called I-Regexp designed to be interoperable across a wide variety of regex implementations. See https://www.ietf.org/id/draft-ietf-jsonpath-iregexp-00.html

A full or even subset implementation of I-Regexp would be a valuable addition to Quamina.

Enable codecov.io

Seems this negativity affects the speed of unit tests @timbray

time go test -race -v -count 1 -coverprofile=coverage.txt -covermode=atomic ./...

takes almost 10m now on my machine, CI much slower breaking unit tests.

SQL-like Expression Language for patterns

In a discussion with a colleague, were I was showcasing quamina, he asked whether the quamina authors considered SQL(-like) expressions instead of JSON patterns, e.g. using CloudEvents SQL Expression Language.

Example:

// input event
{
    "source": "commerce.mysource.example.com",
    "type": "order.created",
    "id": 123
}

// quamina pattern to match on source prefix and type set
{
    "source": [{"shellstyle": "commerce*"}],
    "type": ["order.created", "order.updated", "order.canceled"]
}

// equivalent cesql expression
cesql: "source LIKE 'commerce*' AND type IN ('order.created', 'order.updated', 'order.canceled')"

Without reading the quamina pattern match specification, SQL-like expressions generally are more readable and explicit, e.g. AND. Also, SQL is a widely used and accepted standard, lowering the bar for adopters.

Note that a SQL-like dialect does not have to replace the existing JSON pattern match implementation, but could augment it.

bug: exists:false has issues

Check out closed PR #104

Significant redesign of "exists": false required.

matchSet.addX: hotspot

Hi,

I am doing some performance tests to see If I can use this library instead something I built.

On most of my benchmarks Quamina is faster, there are two main spots which caused terrible degradation:

Flattening - I am passing large objects (7kb~ in size, with nesting) and flattening kills the performance, but since I know the list of fields (& paths) being searched on, I can reduce the incoming event to a smaller one probably.
matchSet.addX - currently it's duplicating the matchSet (even if there are no new matches), this is causing a major degradation.

I did change locally of addX to update in place instead of creating new matchSet, and this is the perf difference I see:

Before:

Benchmark_Quamina-10               27054             49177 ns/op           43311 B/op        160 allocs/op

After:

Benchmark_Quamina-10              221080              6717 ns/op            8932 B/op         28 allocs/op

From what I see, it's done for concurrency concern, but looking at -

quamina/core_matcher.go

Line 151 in 4477484

func (m *coreMatcher) matchesForSortedFields(fields []Field) *matchSet {

it looks like there is no concurrent access.

And maybe simple RWMutex can help us instead of duplication?

I can submit a PR to fix this, but trying to understand what are the option we have in hand.

chore: cleanup for initial release

A few odds and ends so we can do an 0.5 or whatever

add docstrings for the Flattener-related types, e.g. ArrayPos
unexpose MemState

code: Adopt AWS Event Ruler unit tests

There is a huge amount of accumulated wisdom in https://github.com/aws/event-ruler/tree/main/src/test/software/amazon/event/ruler as a consequence of running at millions of invocations / second for years. We should cherry-pick a few. I bet this will turn up deep horrifying bugs in Quamina and learn useful things.

Create CBOR flattener

Allow Quamina to filter CBOR messages.

pat: Fix numeric matching

Quamina doesn't know that "30", "3.0e1", and "30.000" are the same number. We already have code to canonicalize numbers with up to 18 digits of precision in the range +/- 10**9 so they can be matched by a DFA, but canonicalizing is runtime-expensive and if you know that all the numbers in your events are integers, purely wasteful.

I think the best way to fix this is to introduce a "numeric" predicate into patterns:

{ "max": [ {"numeric": [ "=", 30] } ] }

But there might be a case for asking at the Match* API level to request canonicalizing numbers in incoming events.

`README` does not describe/link to `CONTRIBUTING.md`

Clarify supported pattern matchers

From the README the following is not clear IMHO:

Which patterns are supported/implemented?
That this package follows EventBridge pattern matching, i.e. highlight as feature in README to reuse EB patterns?

Having a table which maps existing patterns, e.g. from EventBridge, to supported/unsupported in the README would be nice, e.g.:

Pattern	Suppported	Notes
`exists`	Yes
`shellstyle`	Yes*	Only single wildcard `*` supported
`number`	Yes*	Number matching is weak

Thoughts?

api: Relationship between Flatteners and Matchers

I'm worried about getting into one of those hideous Go circular-dependency traps. Let me describe the situation in free text in hopes of clearing up my own thinking and provoking suggestions on the right way forward.

After a Matcher has been loaded with Patterns, every Event has to be flattened before it can be used with a Matcher call to find which patterns match it.

Both Matcher and Flattener are interfaces, both currently defined in core/. We already have two implementations of Matcher (in core/ and pruner/) but only one implementation of Flattener (core/FJ). I think it is very likely that we will have multiple flatteners in future - for Avro, Protobufs, CBOR, etc etc.

In theory, Matchers and Flatteners don't need to know about each other. A Flattener gets a []byte message in some form and emits a []*Field. Hmm, just realized they both need to know about Field.

A Flattener does need to know which Field names are used in patterns, so it can leave the rest out of the []Field output - this is a really important optimization. In practice this has to be provided by a Matcher because it knows what fields are used as a result of all the AddPattern() calls. There is an interface called NameTracker with one method, IsNameUsed(string name), which in theory should be used by FJ but at the moment it isn't, NewFJ() is given a CoreMatcher, which it uses both for the IsNameUsed call and for the FlattenAndMatch convenience function. So that interface needs to be either used or discarded.

I want to make it easy and straightforward to add a new Flattener for a new data type and now I'm thinking of how to organize things to achieve that. Conceivably the new Flattener shouldn't even have to be part of this repo. Two approaches occur to me:

Leave everything in core/ - minimum hassle and risk of circular dependencies. But, is it still straightforward to implement a new Flattener?
Make new subdirectories for at least Matcher, Flattener, Field. You wouldn't get a convenience function like flattener.FlattenAndMatch() but everything would be maximally clean. I'd lose sleep about wandering into a circular dependency.

Anyhow, at least I've written the problem down now. Input at any level from comprehensive suggestions to "at least don't do X" would be welcome.