luozhouyang / python-string-similarity Goto Github PK

View Code? Open in Web Editor NEW

973.0 25.0 126.0 175 KB

A library implementing different string similarity and distance measures using Python.

License: MIT License

Python 100.00%

python similarity string algorithm distance-measure

python-string-similarity's Issues

Cosine has divide by zero bug

`from strsimpy.cosine import Cosine

def test_cosine4_passes_1():
s0 = "0xxs"
s1 = "foo bar"

c_4 = Cosine(4)
c_4.similarity(s0, s1)

def test_cosine4_fails_1():
s0 = " "
s1 = "foo bar"

c_4 = Cosine(4)
c_4.similarity(s0, s1)

def test_cosine4_fails_2():
s0 = "0 s"
s1 = "foo bar"

c_4 = Cosine(4)
c_4.similarity(s0, s1)`

strsimpy version = 0.2.0
platform darwin -- Python 3.8.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
profile0 = {}
profile1 = {'foo ': 1, 'oo b': 1, 'o ba': 1, ' bar': 1}
norm_0 = 0.0
norm_1 = 2.0
FAILED test/test_cosine_sim.py::test_cosine4 - ZeroDivisionError: float division by zero

ZeroDivisionError when computing Jaccard or Cosine distances

For Jaccard distance

  File "/users/home/docs/strdist.py", line 67, in <listcomp>
    distances = np.array([[jac.distance(w1, w2) for w1 in tokens] for w2 in tokens])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/jaccard.py", line 32, in distance
    return 1.0 - self.similarity(s0, s1)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/jaccard.py", line 51, in similarity
    return 1.0 * inter / len(union)
ZeroDivisionError: float division by zero

And for Cosine distance

  File "/users/home/docs/strdist.py", line 75, in <listcomp>
    distances = np.array([[cos.distance(w1, w2) for w1 in tokens] for w2 in tokens])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/cosine.py", line 35, in distance
    return 1.0 - self.similarity(s0, s1)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/similarity/cosine.py", line 49, in similarity
    self._norm(profile0) * self._norm(profile1))
ZeroDivisionError: float division by zero

Alternate Algorithms from TextDistance

There are other algorithms that are available in TextDistance that may be considered.

Words instead of characters

Hey

Thank you for the package! It's really amazing!

I'm wondering if it's possible to use words instead of characters in Levenshtein and Damerau-Levenshtein methods?

Result of a code below is 10:

from similarity.levenshtein import Levenshtein
print(Levenshtein().distance('Hello world', 'Hello brave new world'))

But if I use words instead of characters in the algorithm, I get 2 (insert 'brave', insert 'new').

Is it possible?

NGram issue with strings shorter than N

NGram distance returns wrong value for strings shorter than N.

from strsimpy.ngram import NGram
ng = NGram()
ng.distance("abc", "abc") == 0.0
ng.distance("a", "b") == 0.0 # should be 1.0

Distance between a and b should be 1.0 as the strings are completely different. if N is two, then the code at https://github.com/luozhouyang/python-string-similarity/blob/master/strsimpy/ngram.py#L45 calculates cost of 0 and returns 1.0 * cost / max(sl, tl). This returns similarity (which actually is 0, because the strings are completely different). However the code is returning normalized distance, which should be maximum possible here.

This issue seems to be at https://github.com/luozhouyang/python-string-similarity/blob/master/strsimpy/ngram.py#L49 where it does

return 1.0 * cost / max(sl, tl)

however I think in this case it should return

return 1.0 - cost / max(sl, tl)

I'll create PR to address this.

Thank you

question: is it possible to extract the words alignment ?

Needing Advice: Best algo(s) for distance based on "proportion of shared substrings"

Hi there, I'm just getting started with string similarity processing.
In my application, I need to compare short-ish strings of length 25-300 characters, and I need the 'distance between any two' metric to reward things like:

Proportion of each string which is shared substrings, and
Sizes of shared substrings, especially relative to the lengths of the strings being compared

Any suggestions, among the wealth of algorithms and modes supported in this package?

Cheers
David

Example / Default / Suggested Weights

Brilliant library - thank you - do you by any chance have any suggested weights for the OCR and QWERTY scenarios, or know if and where some might be available? Thanks again

strsim or strsimpy or pystrsim?

I am trying to work with your library (it looks promising and has great documentation), but get confused by the contradiction of strsimpy coexisting with strsim and pystrsim, having different version numbers but the same code and documentation.

Would you mind deleting some of these projects from PyPI or at least documenting the differences?

(Sorry for opening two separate issues, but I felt they addressed different problems)

speed of JaroWinkler

JaroWinkler is slower than jellyfish's implementation. Also, the results are different.

%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'

import jellyfish
jellyfish.jaro_winkler(a,b) 
# 3.97 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# result > 0.35942760942760943

%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
from strsimpy.jaro_winkler import JaroWinkler
jarowinkler = JaroWinkler()
jarowinkler.distance(a,b)
# 69.8 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# result > 0.6405723905723906

Find most similar word among target and list of words

First of all, let me congratulate the dev for this amazing library.

I was wondering if it is implemented some kind of function that allow to find the most similar word among a target word and a vocabulary. In example;

Target word: tsring
Vocabulary: ['hello', 'world', 'string', 'foo', 'bar']

So maybe something like:

jw = JaroWinkler()
jw.most_similar('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 'string'

I've tried the same construction for the distance and similarity methods but although no error is thrown it seems that the operation is not supported.

jw.distance('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 1.0
jw.similarity('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[2] 0.0

I know it's trivial to implement an independent function with this behavior based on the distance or similarity functions. But just in case a highly-optimized function is already implemented :)

Thanks in advance!

installation problem

I installed via pip and I'm getting ModuleNotFoundError: No module named 'similarity'. Here is my code below. I got this error on 2 machines. I'm using python 3.7.3 64-bit and python 3.5 32-bit.

from similarity.levenshtein import Levenshtein

levenshtein = Levenshtein()
print(levenshtein.distance('My string', 'My $string'))
print(levenshtein.distance('My string', 'My $string'))
print(levenshtein.distance('My string', 'My $string'))

ZeroDivisionError: float division by zero

Got error ZeroDivisionError: float division by zero when running Qgram algorithms

  File "C:\Users\Tensorflow\Anaconda3\lib\site-packages\similarity\sorensen_dice.py", line 51, in similarity
    return 2.0 * inter / (len(profile0) + len(profile1))
ZeroDivisionError: float division by zero

Fixed it with code

# delete short words
content_noShortWords = []
for e in content:
    if len(e)>2:
        content_noShortWords.append(e)

(see here for explanations)
Hope this can help some to get over same problem.

Bug in LCS Function

Hi,

The example of the LCS function you give in the README is wrong. The LCS of 'AGCAT' and 'GAC' is not 4, it is 2. Please see the Wikipedia page for LCS where this example is worked out.

readme, incorrect URL

Python3.x implementation of tdebatty/java-string-similarity

I believe you meant to say Java library

Speed of levenshtein

Thank you for your great package
I compared this package speed with other cpython pakages and it is slower.
is it possible to improve the speed?

a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'
# levenshtein
%%timeit
import editdistance
editdistance.eval(a, b)
# 2.12 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
from strsimpy.levenshtein import Levenshtein
Levenshtein = Levenshtein()
Levenshtein.distance(a,b)
# 528 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)```

Undeclared dependency: Numpy

Hi! I've installed latest version from pip, and get a ModuleNotFoundError:

  File "/home/keddad/Documents/thevyshka-news-fetcher/cacher.py", line 6, in <module>
    from strsimpy.ngram import NGram
  File "/home/keddad/.local/share/virtualenvs/thevyshka-news-fetcher-8gFlsF9b/lib/python3.8/site-packages/strsimpy/__init__.py", line 30, in <module>
    from .optimal_string_alignment import OptimalStringAlignment
  File "/home/keddad/.local/share/virtualenvs/thevyshka-news-fetcher-8gFlsF9b/lib/python3.8/site-packages/strsimpy/optimal_string_alignment.py", line 21, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

There already was #12 about this error, and it was said that latest version doesn't need numpy, but, apparently, it does:
https://github.com/luozhouyang/python-string-similarity/blob/6b8fbd68a535bf92f849bc624d58fd99ef8f46b1/strsimpy/optimal_string_alignment.py

This library does not work in Python 3.10.5. Will you edit this code to make it work in version 3.10.5?

Convert distance to probability

This is a broader question - but do you have any insight on how to convert e.g. Levenshtein distance to probability? I want to combine the edit distance with prior information on the background population - and it's not clear to me how to combine the two metrics. Thanks!

Nondeclared dependency numpy

When running strsimpy.levenshtein import Levenshtein I get the error ModuleNotFoundError: No module named 'numpy' which is correct as I haven't installed numpy on this machine.

I could write a pull request for your setup.py to state the dependencies, but maybe you are aware of more dependencies that should be declared?

(Sorry for opening two separate issues, but I felt they addressed different problems)

luozhouyang / python-string-similarity Goto Github PK

python-string-similarity's Issues

Edit based

Token based

Alternate Method

Recommend Projects

Recommend Topics

Recommend Org