antzucaro / matchr Goto Github PK
View Code? Open in Web Editor NEWAn approximate string matching library for the Go programming language.
License: Other
An approximate string matching library for the Go programming language.
License: Other
I don't know if you're interested, but there's a much easier way of doing Smith-Waterman that involves having an extra row and column at the 0th indexes, whose scores are all zero. You only need a single body loop:
https://github.com/plsql/jh-bio/blob/unique-kmers/bioutils/alignment.go#L63
For example, comparing "heh" and "hhh" will use the below matrix:
0.0 0.0 0.0 0.0
0.0 1.0 1.0 1.0
0.0 0.5 0.5 0.5
0.0 1.0 1.5 1.5
Hi there! I was wondering why there is a hardcoded max length of 4 for double metaphone? (I've also noticed other implementations limit this to 4). Is there a particular design decision behind this?
Have you chosen a license for this project? I'd love to use your Smith-Waterman implementation in a GPL-licensed bioinformatics project of mine:
https://github.com/plsql/jh-bio
Let me know.
Thanks!
In util.go
, the function charAt
is never used and can be removed completely. However, if you intend to use it, you may want to change it.
Currently, if the provided index is out of bounds, the function returns 0. However Go, unlike C, allows null characters (\u0000
) in strings. It's therefore impossible to distinguish between an out-of-bound index and a null character.
You could fix this by returning -1 instead, since the rune
type is just an alias for int32
.
Thanks for your great work! ๐ฏ
In my opinion, BMPM2 is by far the widely best phonetic matcher on earth:
https://stevemorse.org/phonetics/bmpm2.htm
It is well organized and easy to implement:
https://stevemorse.org/phoneticinfo.htm
Please have a look and consider porting from PHP
to Go
๐ฅ
If NYSIIS
function receives a string with only numbers or numbers and symbols it will panic. The function should probably return an empty string instead (""). Line 24 generates a panic because the input is empty if the string is numbers or symbols only.
I have a fix with tests ready for this in a local branch if you want. Or you can use this code right about line 24.
// if no characters are left return blank
if len(input) == 0 {
return ""
}
Test cases:
{"2002", ""},
{"1/2", ""},
{"", ""},
matchr.JaroWinkler("dr", "driveway", true)
--
panic: runtime error: index out of range
github.com/antzucaro/matchr.jaroWinklerBase(0xc82056cf71, 0x2, 0xc82056cfd1, 0x8, 0x101, 0x3fe8000000000000)
/home/vagrant/workspace/go/src/github.com/antzucaro/matchr/jarowinkler.go:100 +0x59d
github.com/antzucaro/matchr.JaroWinkler(0xc82056cf71, 0x2, 0xc82056cfd1, 0x8, 0xb69801, 0xc8202a8418)
/home/vagrant/workspace/go/src/github.com/antzucaro/matchr/jarowinkler.go:134 +0x50
As a hotfix, I've changed line :100 to explicitly checking r1 & r2 have an index at i like this:
for i = 0; i < j && len(r1) > i && len(r2) > i && r1[i] == r2[i] && nan(r1[i]); i++ {
Hi.
I have tried this library and compared it with https://github.com/adrg
In some cases we experience differences in the results:
package main
import (
"fmt"
"github.com/adrg/strutil/metrics"
"github.com/antzucaro/matchr"
)
func main() {
r2 := "wilson kjell"
r1 := "wilson mathias"
fmt.Printf("matchr long distance:%f\n", matchr.JaroWinkler(r1, r2, true))
fmt.Printf("matchr short distance:%f\n", matchr.JaroWinkler(r1, r2, false))
m := metrics.NewJaroWinkler()
fmt.Printf("adrg:%f\n", m.Compare(r2, r1))
}
// matchr long distance:0.694444
// matchr short distance:0.694444
// adrg:0.816667
https://go.dev/play/p/z2IQsqYjIDQ
What is correct distance between these strings?
The origninal implementation (strcmp95) called from perl gives us 0.83523
Thank you.
Hi!
Could you create a release?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.