Coder Social home page Coder Social logo

polyleven's People

Contributors

fujimotos avatar maxbachmann avatar miweiss avatar nick-mazuk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

polyleven's Issues

Optimize Myers1999 for large dictionary filtering

This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.

How is polyleven used in real life?

  • The typical usecase of polyleven is "finding a needle in a haystack"-type tasks.
  • e.g. "I want to find out similar DNA sequences to this input in the 1 billion database"
  • So the task is essentially a filtering/search problem.

What is the optimization idea?

  • As the original paper described, Myers1999 requires a pre-computation of a bit-pattern table.
  • This precomputation is actually very expensive -- I think it takes at least 30% of the total computation budget.
  • The viable optimization path here is to cache the bit-pattern table.

What does the new interface look like?

  • I think the a map function fits very well here:
    >>> polyleven.map("abcde", ["abc", "acde", ...])

How much improvement will wel see?

  • The best case is that we'll see 30-40% speed-up with this optimization.
  • IOW Polyleven will be able to process ~7 million entries per sec on a consumer-grade CPUs (such as AMD Ryzen 7).

Can we vectorlize myers1999?

This is a sketch of an idea on how to speed up Myers1999 further.

Observation

  • Myers1999 consists of a series of bitwise operations on 64-bit integers.
  • Intuitively, this feels very much like a "vectorizable" function.
  • a.k.a. Modern CPUs can perform bitwise operations on several integers in a single instruction.

Question

  • Can we use, say, AVX512 to improve the performance of Myers1999?
  • If indeed we can, this optimization will speed up the computation time as much as 4 times.

References

wheels for polyleven

First of all thanks for making polyleven available - the performance of polyleven is unbeatable !

Would it be possible for you to add support for wheels ?

I am using this library to build a docker image and lack of wheels is increasing the size of the image.

Deprecation Warning: Legacy 'setup.py install' Method

Hi,

I'm getting following deprecation warning during the installation. It looks like there will be a breaking change in pip 23.1:

Collecting polyleven
  Downloading polyleven-0.8.tar.gz (6.4 kB)
  Preparing metadata (setup.py) ... done
Installing collected packages: polyleven
  DEPRECATION: polyleven is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for polyleven ... done
Successfully installed polyleven-0.8

[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: pip install --upgrade pip

Best!

Python 3.10 Support

While I was able to install polyleven on python 3.6 - 3.9, installation did not succeed for python 3.10.

Thanks for providing polyleven - it's great!

The opposite value was calculated using a Levenshtein bound to distance 0.

This can be seen in the following line:

    if (k == 0)
        return PyUnicode_Compare(o1, o2) ? 1 : 0;

The PyUnicode_Compare documentation states:

int PyUnicode_Compare(PyObject *left, PyObject *right)
Compare two strings and return -1, 0, 1 for less than, equal, and greater than, respectively.

This function returns -1 upon failure, so one should call PyErr_Occurred() to check for errors.

The fix should be:

    if (k == 0)
        return PyUnicode_Compare(o1, o2) ? 0 : 1;

Potential performance improvement

I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this while loop inside the mbleven_ascii function.

 while (MBLEVEN_MATRIX[pos]) {
        m = MBLEVEN_MATRIX[pos++];
        i = j = c = 0;
        while (i < len1 && j < len2) {
            if (s1[i] != s2[j]) {
                c++;
                if (!m) break;
                if (m & 1) i++;
                if (m & 2) j++;
                m >>= 2;
            } else {
                i++;
                j++;
            }
        }
        c += (len1 - i) + (len2 - j);
        r = MIN(r, c);
        // add improvements here
    }

At the end of the outer while loop, we can short circuit if:

  • r == 0: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so once r == 0, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., if k = 3 and d = 0, where d is the difference in the length of the words, we save 6 unneeded iterations).
  • r == 1: If this happens, we know for a fact that the words are not identical. If they were, we'd get r == 0. Since r == 1 is the smallest number of possible edits when the words are not equal, we can short-circuit early and return 1 immediately.

Hence, you can just add

if (r < 2) {
    return r;
}

This should apply to both the ASCII and strbuf cases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.