luozhouyang / python-string-similarity Goto Github PK
View Code? Open in Web Editor NEWA library implementing different string similarity and distance measures using Python.
License: MIT License
A library implementing different string similarity and distance measures using Python.
License: MIT License
`from strsimpy.cosine import Cosine
def test_cosine4_passes_1():
s0 = "0xxs"
s1 = "foo bar"
c_4 = Cosine(4)
c_4.similarity(s0, s1)
def test_cosine4_fails_1():
s0 = " "
s1 = "foo bar"
c_4 = Cosine(4)
c_4.similarity(s0, s1)
def test_cosine4_fails_2():
s0 = "0 s"
s1 = "foo bar"
c_4 = Cosine(4)
c_4.similarity(s0, s1)`
strsimpy version = 0.2.0
platform darwin -- Python 3.8.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
profile0 = {}
profile1 = {'foo ': 1, 'oo b': 1, 'o ba': 1, ' bar': 1}
norm_0 = 0.0
norm_1 = 2.0
FAILED test/test_cosine_sim.py::test_cosine4 - ZeroDivisionError: float division by zero
For Jaccard distance
File "/users/home/docs/strdist.py", line 67, in <listcomp>
distances = np.array([[jac.distance(w1, w2) for w1 in tokens] for w2 in tokens])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/jaccard.py", line 32, in distance
return 1.0 - self.similarity(s0, s1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/jaccard.py", line 51, in similarity
return 1.0 * inter / len(union)
ZeroDivisionError: float division by zero
And for Cosine distance
File "/users/home/docs/strdist.py", line 75, in <listcomp>
distances = np.array([[cos.distance(w1, w2) for w1 in tokens] for w2 in tokens])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/cosine.py", line 35, in distance
return 1.0 - self.similarity(s0, s1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/cosine.py", line 49, in similarity
self._norm(profile0) * self._norm(profile1))
ZeroDivisionError: float division by zero
There are other algorithms that are available in TextDistance that may be considered.
Major Source of Info: https://github.com/life4/textdistance
Hey
Thank you for the package! It's really amazing!
I'm wondering if it's possible to use words instead of characters in Levenshtein and Damerau-Levenshtein methods?
Result of a code below is 10:
from similarity.levenshtein import Levenshtein
print(Levenshtein().distance('Hello world', 'Hello brave new world'))
But if I use words instead of characters in the algorithm, I get 2 (insert 'brave', insert 'new').
Is it possible?
NGram distance returns wrong value for strings shorter than N.
from strsimpy.ngram import NGram
ng = NGram()
ng.distance("abc", "abc") == 0.0
ng.distance("a", "b") == 0.0 # should be 1.0
Distance between a
and b
should be 1.0
as the strings are completely different. if N is two, then the code at https://github.com/luozhouyang/python-string-similarity/blob/master/strsimpy/ngram.py#L45 calculates cost of 0 and returns 1.0 * cost / max(sl, tl)
. This returns similarity (which actually is 0, because the strings are completely different). However the code is returning normalized distance, which should be maximum possible here.
This issue seems to be at https://github.com/luozhouyang/python-string-similarity/blob/master/strsimpy/ngram.py#L49 where it does
return 1.0 * cost / max(sl, tl)
however I think in this case it should return
return 1.0 - cost / max(sl, tl)
I'll create PR to address this.
Thank you
Hi there, I'm just getting started with string similarity processing.
In my application, I need to compare short-ish strings of length 25-300 characters, and I need the 'distance between any two' metric to reward things like:
Any suggestions, among the wealth of algorithms and modes supported in this package?
Cheers
David
Brilliant library - thank you - do you by any chance have any suggested weights for the OCR and QWERTY scenarios, or know if and where some might be available? Thanks again
I am trying to work with your library (it looks promising and has great documentation), but get confused by the contradiction of strsimpy coexisting with strsim and pystrsim, having different version numbers but the same code and documentation.
Would you mind deleting some of these projects from PyPI or at least documenting the differences?
(Sorry for opening two separate issues, but I felt they addressed different problems)
JaroWinkler is slower than jellyfish's implementation. Also, the results are different.
%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
import jellyfish
jellyfish.jaro_winkler(a,b)
# 3.97 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# result > 0.35942760942760943
%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
from strsimpy.jaro_winkler import JaroWinkler
jarowinkler = JaroWinkler()
jarowinkler.distance(a,b)
# 69.8 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# result > 0.6405723905723906
First of all, let me congratulate the dev for this amazing library.
I was wondering if it is implemented some kind of function that allow to find the most similar word among a target word and a vocabulary. In example;
Target word: tsring
Vocabulary: ['hello', 'world', 'string', 'foo', 'bar']
So maybe something like:
jw = JaroWinkler()
jw.most_similar('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 'string'
I've tried the same construction for the distance
and similarity
methods but although no error is thrown it seems that the operation is not supported.
jw.distance('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 1.0
jw.similarity('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[2] 0.0
I know it's trivial to implement an independent function with this behavior based on the distance
or similarity
functions. But just in case a highly-optimized function is already implemented :)
Thanks in advance!
I installed via pip and I'm getting ModuleNotFoundError: No module named 'similarity'. Here is my code below. I got this error on 2 machines. I'm using python 3.7.3 64-bit and python 3.5 32-bit.
from similarity.levenshtein import Levenshtein
levenshtein = Levenshtein()
print(levenshtein.distance('My string', 'My $string'))
print(levenshtein.distance('My string', 'My $string'))
print(levenshtein.distance('My string', 'My $string'))
Got error ZeroDivisionError: float division by zero
when running Qgram algorithms
File "C:\Users\Tensorflow\Anaconda3\lib\site-packages\similarity\sorensen_dice.py", line 51, in similarity
return 2.0 * inter / (len(profile0) + len(profile1))
ZeroDivisionError: float division by zero
Fixed it with code
# delete short words
content_noShortWords = []
for e in content:
if len(e)>2:
content_noShortWords.append(e)
(see here for explanations)
Hope this can help some to get over same problem.
Hi,
The example of the LCS function you give in the README is wrong. The LCS of 'AGCAT' and 'GAC' is not 4, it is 2. Please see the Wikipedia page for LCS where this example is worked out.
Python3.x implementation of tdebatty/java-string-similarity
I believe you meant to say Java library
Thank you for your great package
I compared this package speed with other cpython pakages and it is slower.
is it possible to improve the speed?
a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'
# levenshtein
%%timeit
import editdistance
editdistance.eval(a, b)
# 2.12 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
from strsimpy.levenshtein import Levenshtein
Levenshtein = Levenshtein()
Levenshtein.distance(a,b)
# 528 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)```
Hi! I've installed latest version from pip, and get a ModuleNotFoundError:
File "/home/keddad/Documents/thevyshka-news-fetcher/cacher.py", line 6, in <module>
from strsimpy.ngram import NGram
File "/home/keddad/.local/share/virtualenvs/thevyshka-news-fetcher-8gFlsF9b/lib/python3.8/site-packages/strsimpy/__init__.py", line 30, in <module>
from .optimal_string_alignment import OptimalStringAlignment
File "/home/keddad/.local/share/virtualenvs/thevyshka-news-fetcher-8gFlsF9b/lib/python3.8/site-packages/strsimpy/optimal_string_alignment.py", line 21, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
There already was #12 about this error, and it was said that latest version doesn't need numpy, but, apparently, it does:
https://github.com/luozhouyang/python-string-similarity/blob/6b8fbd68a535bf92f849bc624d58fd99ef8f46b1/strsimpy/optimal_string_alignment.py
This is a broader question - but do you have any insight on how to convert e.g. Levenshtein distance to probability? I want to combine the edit distance with prior information on the background population - and it's not clear to me how to combine the two metrics. Thanks!
When running strsimpy.levenshtein import Levenshtein
I get the error ModuleNotFoundError: No module named 'numpy'
which is correct as I haven't installed numpy on this machine.
I could write a pull request for your setup.py to state the dependencies, but maybe you are aware of more dependencies that should be declared?
(Sorry for opening two separate issues, but I felt they addressed different problems)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.