fujimotos / polyleven Goto Github PK

View Code? Open in Web Editor NEW

79.0 4.0 10.0 523 KB

Fast Levenshtein Distance Library for Python 3

Home Page: https://ceptord.net

License: MIT License

C 53.55% Python 43.98% Makefile 2.47%

levenshtein-distance

polyleven's People

Contributors

Stargazers

Watchers

Forkers

pombredanne zktuong rg089 kdent nick-mazuk iamdank miweiss datacycling matt-tony kimdwkimdw

polyleven's Issues

Optimize Myers1999 for large dictionary filtering

This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.

How is polyleven used in real life?

The typical usecase of polyleven is "finding a needle in a haystack"-type tasks.
e.g. "I want to find out similar DNA sequences to this input in the 1 billion database"
So the task is essentially a filtering/search problem.

What is the optimization idea?

As the original paper described, Myers1999 requires a pre-computation of a bit-pattern table.
This precomputation is actually very expensive -- I think it takes at least 30% of the total computation budget.
The viable optimization path here is to cache the bit-pattern table.

What does the new interface look like?

I think the a map function fits very well here:

>>> polyleven.map("abcde", ["abc", "acde", ...])

How much improvement will wel see?

The best case is that we'll see 30-40% speed-up with this optimization.
IOW Polyleven will be able to process ~7 million entries per sec on a consumer-grade CPUs (such as AMD Ryzen 7).

Can we vectorlize myers1999?

This is a sketch of an idea on how to speed up Myers1999 further.

Observation

Myers1999 consists of a series of bitwise operations on 64-bit integers.
Intuitively, this feels very much like a "vectorizable" function.
a.k.a. Modern CPUs can perform bitwise operations on several integers in a single instruction.

Question

Can we use, say, AVX512 to improve the performance of Myers1999?
If indeed we can, this optimization will speed up the computation time as much as 4 times.

References

https://blog.cr.yp.to/20190430-vectorize.html

wheels for polyleven

First of all thanks for making polyleven available - the performance of polyleven is unbeatable !

Would it be possible for you to add support for wheels ?

I am using this library to build a docker image and lack of wheels is increasing the size of the image.

Deprecation Warning: Legacy 'setup.py install' Method

Hi,

I'm getting following deprecation warning during the installation. It looks like there will be a breaking change in pip 23.1:

Collecting polyleven
  Downloading polyleven-0.8.tar.gz (6.4 kB)
  Preparing metadata (setup.py) ... done
Installing collected packages: polyleven
  DEPRECATION: polyleven is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for polyleven ... done
Successfully installed polyleven-0.8

[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: pip install --upgrade pip

Best!

Python 3.10 Support

While I was able to install polyleven on python 3.6 - 3.9, installation did not succeed for python 3.10.

Thanks for providing polyleven - it's great!

The opposite value was calculated using a Levenshtein bound to distance 0.

This can be seen in the following line:

    if (k == 0)
        return PyUnicode_Compare(o1, o2) ? 1 : 0;

The PyUnicode_Compare documentation states:

int PyUnicode_Compare(PyObject *left, PyObject *right)
Compare two strings and return -1, 0, 1 for less than, equal, and greater than, respectively.

This function returns -1 upon failure, so one should call PyErr_Occurred() to check for errors.

The fix should be:

    if (k == 0)
        return PyUnicode_Compare(o1, o2) ? 0 : 1;

Potential performance improvement

I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this while loop inside the mbleven_ascii function.

 while (MBLEVEN_MATRIX[pos]) {
        m = MBLEVEN_MATRIX[pos++];
        i = j = c = 0;
        while (i < len1 && j < len2) {
            if (s1[i] != s2[j]) {
                c++;
                if (!m) break;
                if (m & 1) i++;
                if (m & 2) j++;
                m >>= 2;
            } else {
                i++;
                j++;
            }
        }
        c += (len1 - i) + (len2 - j);
        r = MIN(r, c);
        // add improvements here
    }

At the end of the outer while loop, we can short circuit if:

r == 0: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so once r == 0, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., if k = 3 and d = 0, where d is the difference in the length of the words, we save 6 unneeded iterations).
r == 1: If this happens, we know for a fact that the words are not identical. If they were, we'd get r == 0. Since r == 1 is the smallest number of possible edits when the words are not equal, we can short-circuit early and return 1 immediately.

Hence, you can just add

if (r < 2) {
    return r;
}

This should apply to both the ASCII and strbuf cases.

Question: How to apply the function for 100k datapoints?

Is there a way to pass pandas data frame or python list?

fujimotos / polyleven Goto Github PK

polyleven's People

Contributors

Stargazers

Watchers

Forkers

polyleven's Issues

Optimize Myers1999 for large dictionary filtering

Can we vectorlize myers1999?

wheels for polyleven

Deprecation Warning: Legacy 'setup.py install' Method

Python 3.10 Support

The opposite value was calculated using a Levenshtein bound to distance 0.

Potential performance improvement

Question: How to apply the function for 100k datapoints?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent