fujimotos / polyleven Goto Github PK
View Code? Open in Web Editor NEWFast Levenshtein Distance Library for Python 3
Home Page: https://ceptord.net
License: MIT License
Fast Levenshtein Distance Library for Python 3
Home Page: https://ceptord.net
License: MIT License
This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.
How is polyleven used in real life?
What is the optimization idea?
What does the new interface look like?
map
function fits very well here:
>>> polyleven.map("abcde", ["abc", "acde", ...])
How much improvement will wel see?
This is a sketch of an idea on how to speed up Myers1999 further.
Observation
Question
References
First of all thanks for making polyleven available - the performance of polyleven is unbeatable !
Would it be possible for you to add support for wheels ?
I am using this library to build a docker image and lack of wheels is increasing the size of the image.
Hi,
I'm getting following deprecation warning during the installation. It looks like there will be a breaking change in pip 23.1:
Collecting polyleven
Downloading polyleven-0.8.tar.gz (6.4 kB)
Preparing metadata (setup.py) ... done
Installing collected packages: polyleven
DEPRECATION: polyleven is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
Running setup.py install for polyleven ... done
Successfully installed polyleven-0.8
[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Best!
While I was able to install polyleven on python 3.6 - 3.9, installation did not succeed for python 3.10.
Thanks for providing polyleven - it's great!
This can be seen in the following line:
if (k == 0)
return PyUnicode_Compare(o1, o2) ? 1 : 0;
The PyUnicode_Compare documentation states:
int PyUnicode_Compare(PyObject *left, PyObject *right)
Compare two strings and return -1, 0, 1 for less than, equal, and greater than, respectively.
This function returns -1 upon failure, so one should call PyErr_Occurred() to check for errors.
The fix should be:
if (k == 0)
return PyUnicode_Compare(o1, o2) ? 0 : 1;
I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this while
loop inside the mbleven_ascii
function.
while (MBLEVEN_MATRIX[pos]) {
m = MBLEVEN_MATRIX[pos++];
i = j = c = 0;
while (i < len1 && j < len2) {
if (s1[i] != s2[j]) {
c++;
if (!m) break;
if (m & 1) i++;
if (m & 2) j++;
m >>= 2;
} else {
i++;
j++;
}
}
c += (len1 - i) + (len2 - j);
r = MIN(r, c);
// add improvements here
}
At the end of the outer while
loop, we can short circuit if:
r == 0
: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so once r == 0
, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., if k = 3
and d = 0
, where d is the difference in the length of the words, we save 6 unneeded iterations).r == 1
: If this happens, we know for a fact that the words are not identical. If they were, we'd get r == 0
. Since r == 1
is the smallest number of possible edits when the words are not equal, we can short-circuit early and return 1
immediately.Hence, you can just add
if (r < 2) {
return r;
}
This should apply to both the ASCII and strbuf
cases.
Is there a way to pass pandas data frame or python list?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.