smarty-archives / mafsa Goto Github PK

Package mafsa implements Minimal Acyclic Finite State Automata in Go, essentially a high-speed, memory-efficient, Unicode-friendly set of strings.

Home Page: https://godoc.org/github.com/smartystreets/mafsa

License: Other

Go 100.00%

mafsa's People

Contributors

Stargazers

Watchers

mafsa's Issues

Elaborate on fuzzy matching and autocomplete?

From the README,

Basically, it's a set of strings that lets you test for membership, do spelling correction (fuzzy matching) and autocomplete, but with higher memory efficiency than a regular trie.

Can you please elaborate on how this can be used for fuzzy matching and autocomplete?

As far as I can tell, you can only test for strict presence of a full string. It's not possible to check if a substring or a fuzzy match exists in the set, is it?

Consistent use of []rune and string

We should consider making more consistent use of either []rune or string rather than both in arguments of the exported functions/methods.

Share the testing data

Thanks for making this. Can you please share the 8M strings of testing data you used for this? This will give a good point of comparison with other libs out there.

unused Entry and unnecessary entry

I've been getting a deeper understanding of the code and found the following.

The type Entry defined in mintree.go does not appear to be used anywhere. It is defined here: https://github.com/smartystreets/mafsa/blob/1575156d598714bb1c41713191eefb7f5ccf3e8b/mintree.go#L26
The entry argument to decodeEdge(...) seems to serve no purpose. We only ever append to it and pass to recursive calls, but its never used for anything. Further, when I remove it, and all usage of it, all tests in the package continue to pass. See: https://github.com/smartystreets/mafsa/blob/1575156d598714bb1c41713191eefb7f5ccf3e8b/decoder.go#L85

Is this just left over from some previous version of the implementation? Or am I missing some key aspect of the way things are working?

The encoding currently only supports ASCII (oops - forgot to upgrade that) since each character only gets 1 byte. This should be expanded to more bytes when the user wants to encode trees with Unicode characters.

However, since this greatly increases the file size and may not be necessary, so this option should be configurable.

Impossible to use utf

It seems something wrong with encoding or decoding of utf. Consider this test:

import (
	"github.com/smartystreets/mafsa"
	"github.com/stretchr/testify/require"
	"testing"
)

func TestRussian(t *testing.T) {
	a := mafsa.New()
	a.Insert("я") // I'm in russian
	a.Finish()
	a.Save("test")
	require.True(t, a.Contains("я")) // Fine

	b, err := mafsa.Load("test")
	require.NoError(t, err)
	require.True(t, b.Contains("я")) // Failure!!
}

func TestSpanish(t *testing.T) {
	a := mafsa.New()
	a.Insert("Gracías")
	a.Finish()
	a.Save("test")
	require.True(t, a.Contains("Gracías")) // Thank you in spanish

	b, err := mafsa.Load("test")
	require.NoError(t, err)
	require.True(t, b.Contains("Gracías")) // Failure!!
}

Unusual Encoder/Decoder signatures?

Please correct me if I'm wrong. I don't think there's a standard interface for Encoder/Decoder types, but as far as I can see, they all typically take an io.Reader to create a Decoder, and io.Writer to create an Encoder.

For example, look at:

http://godoc.org/encoding/json#NewDecoder
http://godoc.org/encoding/xml#NewDecoder
http://godoc.org/encoding/ascii85#NewDecoder

Then, their Encode/Decode methods take one parameter and return just error.

The Encoder/Decoder in this package do not take a writer/reader, but instead return the encoded/decoded data alongside the error.

My question is, would it be a better API if they did take a Reader/Writer?

When encoding, adjust pointer size

To minimize the size of the file even more, the size of the pointer should be adjusted based on how many items are in the tree that are being encoded. Right now we simply use 4 bytes, but small trees don't need this many. 4 bytes allows over 4 billion items...

The pointer size should be automatically adjusted or configurable.

smarty-archives / mafsa Goto Github PK

mafsa's People

Contributors

Stargazers

Watchers

Forkers

mafsa's Issues

Elaborate on fuzzy matching and autocomplete?

Consistent use of []rune and string

Share the testing data

unused Entry and unnecessary entry

Support Unicode encodings

Impossible to use utf

Unusual Encoder/Decoder signatures?

When encoding, adjust pointer size

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent