Coder Social home page Coder Social logo

iohub / ahocorasick Goto Github PK

View Code? Open in Web Editor NEW
116.0 116.0 20.0 5.96 MB

A fast and memory efficient implementation of aho-corasick algorithm based on double-array trie (cedar), supports visualizing structure via graphviz.

License: GNU General Public License v2.0

Go 99.78% Makefile 0.22%
aho-corasick double-array-trie visualization

ahocorasick's Introduction

Hi robots 👋

IOhub's github stats

ahocorasick's People

Contributors

akfaew avatar iohub avatar ustack avatar zensig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ahocorasick's Issues

Failed with test from cloudflare/ahocorasick

func TestInterior(t *testing.T) {
	m := NewMatcher()
	for i, v := range []string{"Steel", "tee", "e"} {
		m.Insert([]byte(v), i)
	}
	m.Compile()

	seq := []byte("The Man Of Steel: Superman")
	mp := m.Match(seq)

	for _, item := range mp {
		fmt.Printf("item.Pos: %d item.OutID: %d\n", item.Pos, item.OutID)
	}
	req := m.ExportMItem(seq, mp)
	assert(t, len(req) == 3)

	for _, item := range req {
		fmt.Printf("key:%s value:%d\n", item.Key, item.Value.(int))
	}

	assert(t, req[0].Value.(int) == 2)
	assert(t, req[1].Value.(int) == 1)
	assert(t, req[2].Value.(int) == 0)
}

Output:

item.Pos: 2 item.OutID: 100
item.Pos: 15 item.OutID: 259
item.Pos: 21 item.OutID: 100
key:e value:2
key:Steel value:0
key:e value:2

How many word dictionary trees can ahocorasick build at most

I build with 4000W key-value,occur erro
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x5069d0, 0x16)
/home/work/local/go/src/runtime/panic.go:1116 +0x72
runtime.sysMap(0xc664000000, 0x400000000, 0x5f6438)
/home/work/local/go/src/runtime/mem_linux.go:169 +0xc6
runtime.(*mheap).sysAlloc(0x5dc060, 0x400000000, 0x429d17, 0x5dc068)
/home/work/local/go/src/runtime/malloc.go:727 +0x1e5
runtime.(*mheap).grow(0x5dc060, 0x200000, 0x0)
/home/work/local/go/src/runtime/mheap.go:1344 +0x85
runtime.(*mheap).allocSpan(0x5dc060, 0x200000, 0xc00c880100, 0x5f6448, 0xfffffffffffffade)
/home/work/local/go/src/runtime/mheap.go:1160 +0x6b6
runtime.(*mheap).alloc.func1()
/home/work/local/go/src/runtime/mheap.go:907 +0x65
runtime.(*mheap).alloc(0x5dc060, 0x200000, 0x450101, 0x5dc060)
/home/work/local/go/src/runtime/mheap.go:901 +0x85
runtime.largeAlloc(0x400000000, 0x5f0101, 0x7ff8a5b59080)
/home/work/local/go/src/runtime/malloc.go:1177 +0x92
runtime.mallocgc.func1()
/home/work/local/go/src/runtime/malloc.go:1071 +0x46
runtime.systemstack(0x0)
/home/work/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
/home/work/local/go/src/runtime/proc.go:1116

Matching hangs forever when empty key is inserted

I accidentally had empty string in my test data and was not able to figure out why tests hang. Maybe consider ingoring such keys when inserting?

func TestAhoCorasick(t *testing.T) {
	matcher := ahocorasick.NewMatcher()
	matcher.Insert([]byte("adidas"), true)
	matcher.Insert([]byte(""), true)
	matcher.Compile()
	fmt.Println("compiled matcher")

	seq := []byte("adidas sneakers")
	resp := matcher.Match(seq)
	for resp.HasNext() {
		items := resp.NextMatchItem(seq)
		for _, itr := range items {
			fmt.Println("Matched", string(matcher.Key(seq, itr)))
		}
	}
	resp.Release()
}

Issues with concurrency

I'm getting transient errors when trying to create multiple automaton in different threads. I keep getting panic: runtime error: index out of range [-x] errors that pop up in different places while building the automatons.

When you state that "This package is not thread safe if there is one goroutine doing insertions or deletions.", does that mean even when each thread is accessing a different automaton?

bad results when insert order change

queryData := `xxxxxx`
kvs := map[string]ValutType{....}  // about 1w keys
m := NewMatcher()
for i:=0; i < 10; i++ {
  m := NewMatcher()
  for k, v := range kvs {
      m.Insert([]byte(key), v)
  }
  m.Compile()
  m.MatchResult() => Results.   // ERROR HERE, it will give different result in round i, map iterate order is not guaranteed in go 
}

could anyone help fix this. i'm not familiar with double-arrary trie

Insert followed by delete, repeat 1000 times, insert will have a loop problem

for h := 0; h < 1000; h++ {
		matcherFacade.Add("dgogo" + strconv.Itoa(h))
		matcherFacade.Remove("dgogo" + strconv.Itoa(h))
	}


// Add a word
func (m *MatcherFacade) Add(word string) {
	if word == constant.StringEmpty {
		return
	}
	defer m.recoverPanic("Add:" + word)

	m.mu.Lock()
	defer m.mu.Unlock()
	if !m.isExist(word) {
		m.matcher.Insert([]byte(word), 0)
		m.matcher.Compile()
	}

}

// Remove a word 注意,加一个词后不能紧接着删这个词,vKey会有循环问题
func (m *MatcherFacade) Remove(word string) {
	defer m.recoverPanic("Remove:" + word)

	m.mu.Lock()
	defer m.mu.Unlock()
	if m.isExist(word) {
		m.matcher.Delete([]byte(word))
		m.matcher.Compile()
	}

}

The following code has a dead-loop problem:

func (da *Cedar) vKey() int {
	k := da.vkey
	for {
		k = (k + 1) % da.capacity
		if _, ok := da.vals[k]; !ok {
			break
		}
	}
	da.vkey = k
	return k
}

MItem's Freq property is always 0

Hi,

When you look at https://godoc.org/github.com/iohub/Ahocorasick#MItem the MItem should have a Freq property, which one would assume is the frequency of occurrence, but when you look at the code here:

https://github.com/iohub/Ahocorasick/blob/master/acmatcher.go

it actually only ever takes the default 0 value for undefined ints:

func (m *Matcher) mItemOf(seq []byte, offset, id int) []MItem {
	req := []MItem{}
	for e := &m.output[id]; e != nil; e = e.Link {
		nval := m.da.vals[e.vKey]
		len := nval.len
		val := nval.Value
		if len == 0 {
			continue
		}
		bs := seq[offset-len+1 : offset+1]
		req = append(req, MItem{Key: bs, Value: val})
	}
	return req
}

I could be wrong, I am new to Go, but it seems misleading.

Cheers!
M

fix match buffer oversize panic - memory usage

Hey, I have a question about the latest commit: bb845ee

Supposed we have a 1MB text that looks like this: "aaaaaaaa...aaa"
And keywords that look like this: ["a", "aa", "aaa", ... ]

Will the algorithm not take a whole lot of memory? There will be 1M matches of "a", 1M-1 matches of "aa" etc. This will probably exhaust all system memory but please correct me if I understood it wrong.

Wrong match result with large data

I run a simple test to count the number of occurrence for every keyword, but the result seems incorrect.
For the benchmark testdata, the number of keyword "abbe" should be 36, but I only got 10.

func testIohub(dict [][]byte, content []byte) {
	mem := new(runtime.MemStats)
	runtime.GC()
	runtime.ReadMemStats(mem)
	before := mem.TotalAlloc

	var m *iohub.Matcher

	func() {
		defer calcTime(time.Now(), "iohub/ahocorasick [build]")
		m = iohub.NewMatcher()
		for i, bs := range dict {
			m.Insert(bs, i)
		}
		m.Compile()
	}()

	runtime.GC()
	runtime.ReadMemStats(mem)
	after := mem.TotalAlloc
	fmt.Printf("iohub/ahocorasick [mem]\t\t %d KBytes\n", (after-before)/1024)

	func() {
		result := map[string]int{}
		defer func() {
			writeResult("result/iohub.txt", result)
		}()
		defer calcTime(time.Now(), "iohub/ahocorasick [match]")
		it := m.Match(content)
		for it.HasNext() {
			tokens := it.NextMatchItem(content)
			for _, t := range tokens {
				key := m.Key(content, t)
				result[string(key)]++
			}
		}

	}()
}

Loading a large number of strings_ index out of range

I am getting an error when searching against a large number of strings. It works with 5 million strings but fails from around 7 million strings..is there any limit to number of strings loaded and can it be adjusted...

The error is
Runtime error index out of range
Trace..cedar.go.line 320《- acmatcher.go line 138

Any suggestions to allow for more strings

Usage of redis.. question

just wondering if it would be possible to cache the tried data structure in redis with this package.. this will help in being able to load the strings one time and have multiple consumers doing a read only access.. If feasible where the major changes would be..

Appreciate any opinions on this..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.