sp1ff / damerau-levenshtein Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 1.0 158 KB

Comparison of a few algorithms for computing Damerau–Levenshtein distance

License: GNU General Public License v3.0

Makefile 2.70% Shell 60.59% M4 1.31% C++ 35.40%

damerau-levenshtein damerau-levenshtein-distance cpp string-matching string-distance

damerau-levenshtein's Introduction

Comparing Damerau-Levenshtein Implementations

Introduction

This repository contains three implementations of Damerau-Levenshtein string distance computation in C++ 17.

The Program

This package builds a program imaginatively named dl. It is installed via the usual autotools incantations. dl will compute Damerau-Levenshtein distance over a corpus of input strings provided on the command line with one of three algorithms (on which more below). For performance benchmarking, you can specify that the program should run over the corpus repeatedly and/or print timing information on completion. Say dl --help for full usage.

Discussion

The Damerau-Levenshtein distance between two strings A & B is the minimal number of insertions, deletions, single-character changes & transpositions needed to transform A into B (e.g. “act” -> “cat” -> “cart”, so the D-L distance between “act” & “cart” is two). In his original paper [1] Damerau claimed that 80% of the errors in the system which gave rise to his work could be accounted for by one of these four errors.

While it’s a popular algorithm for fuzzy searching, there doesn’t seem to be a “go-to” implementations out there in C/C++ (it is a feature request for the boost string library). There are several others on Github, but the ones I examined lacked both documentation on the algorithms used as well as test suites.

This repo contains implementations of what I understand to be the three major algorithms implementing Damerau-Levenshtein distance in C++.

Lowrance & Wagner’s Recursive Relation

This is the algorithm laid out in Wikipedia (at the time of this writing). It is due to Lowrance & Wagner [2] and seems to be the most widely known. In their paper, they prove that under a few (reasonable) conditions, a simple |A| x |B| recurrence relation is sufficient to compute the D-L distance between strings A & B. Their algorithm runs in O(|A|*|B|) time & space.

Ukkonen

Ten years later, Ukkonen substantially improved the performance of this calculation [3]. His paper contained two major advancements. He proved (again under conditions) that in order to compute the D-L distance, one need only compute the recurrence relation in a fairly tight band around its main diagonal (substantially decreasing the number of operations required). Next, he moved from the primary recurrence relation of Lowrance & Wagner to a dual problem of computing f(k,p) which is defined as the maximal index i on diagonal k for which the edit distance between A(1..i) & B(1..j) is p; this doesn’t reduce the time complexity but does reduce that of space. His algorithm runs in O(s*min(|A|,|B|)) (where s is the D-L distance between A & B) and space O(min(s,|A|,|B|))).

Berghel & Roach

Eleven years on, Berghel & Roach improved Ukkonen’s algorithm by deriving even tighter bounds on the region of the region of the recurrence relation that needs to be computed, leading to (in their application) a 42% speedup over Ukkonen. If s is the edit distance from A to B, and m & n their respective string lengths, and WLOG n >= m, define p to be 1/2s - 1/(2*(n-m)). Then the worst-case running time of this algorithm is O(n*p) and the expected running time is O(n+(p*s)).

After reading this paper, I spent some time searching the literature for references to it (looking for further improvements). I didn’t find any directly-related improvements. Later references either cited it as the best known to the author(s) for computing Damerau-Levenshtein distance, or as a point of interest in developing different string metrics.

In my benchmarking, I’ve found it important to avoid the use of std::vector, which surprises me. I’ve instead just used flat arrays & done the two-dimensional indexing manually.

All the other references check for common prefixes before starting the algorithm, which I haven’t tested, yet.

References

Fred J. Damerau, A Technique for Computer Detection and Correction of Spelling Errors, Communications of the ACM, 7 (1964) No. 3, 171-176.
Roy Lowrance and Robert A. Wagner, An Extension of the String-to-String Correction Problem, Journal of the Association for Computing Machinery, 22 (1975), No 2, 177-183.
Esko Ukkonen, Algorithms for Approximate String Matching, Information and Control, 64 (1985) 100–118.
Hal Berghel and David Roach, An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm, ACM Transactions on Information Systems, 14 (1996) No. 1, 94-106.

damerau-levenshtein's People

Contributors

Stargazers

Watchers

Forkers

noisehustler

damerau-levenshtein's Issues

Issue#2 still exists: attached is a proposed solution with some advantages

First of all, thank you very much for showing your implementation of the Berghel-Roach algorithm here! For me, who doesn't deal with the topic professionally, the explanations in the authors' article were too cryptic that I wouldn't have dared to write the code from scratch myself. Your example helped me a lot!

However, the explanations in the authors' article are not only cryptic but also incomplete, as issue#2 shows. Now I have noticed that even after your proposed solution to this problem, function f() continues to erroneously overwrite parts of the fixed range of fkp[].

Example of the incorrect behavior: the calculation of dl("bcac", "cc") changes fkp[13] from -4 to -3.

I have found a new solution, which seems better to me for the following reasons:

The overwriting of fixed ranges of fkp[] is completely eliminated.
The code becomes more efficient, since many useless calls of f() are omitted and some case distinctions in f() are no longer needed.

My suggestions for changes to the br.cc file in detail:

Set p in line 88 as follows: p = abs(k);.
Delete the following lines: 36, 37, 39, 50, 51, 53-55, 57, 63, and 65.
Line 38 becomes to: ptrdiff_t t = fkp[(k + zero_k)*max_p+p] + 1;.
Line 41 can be expressed a little more clearly as follows: if (t > 0 && t < m && k + t > 0 && k + t < n) {.
Line 52 becomes to: ptrdiff_t ta = fkp[(k - 1 + zero_k)*max_p+p];.
Line 56 becomes to: ptrdiff_t tb = fkp[(k + 1 + zero_k)*max_p+p] + 1;.
Simplify line 62 as follows: while (t < L && A[t] == B[t+k]) ++t;

Verification: After these changes have been made to the code, fkp[] contains the patterns that Berghel & Roach present in Figures 3 and 5 as examples for a worst-case and an "average" scenario, i.e. dl("ABCDE", "FGHIJ") and dl("AVERY", "GARVEY").

What do you think?

PS: There is still a difference between fkp[] from the article and the one produced by your code: in the unused, variable areas the value -inf is specified instead of 0. For a more universal, more robust variant of the code, this should be changed along with two other small details, see issue#4.

Proposal for better robustness of the Berghel-Roach algorithm

The code has small weaknesses for certain special cases, namely if one of the two compared strings has the length 0 or this applies to both. Here is my suggestion to solve the problem:

The volatile, overwritable area of the array fkp[] must be initialized with 0 instead of -inf so that the code handles the special cases mentioned correctly. To do this, the following changes must be made (file br.hh):

Delete lines 137-141.
The complete initialization then takes place in the two nested loops lines 142-154, which must be supplemented as follows:

for (ptrdiff_t k = - zero_k; k <= zero_k; ++k) {
    ptrdiff_t abs_k = k;
    if (k < 0) abs_k = -k;
    for (ptrdiff_t p = -1; p <= (ptrdiff_t)inf; ++p) {
        if (p == abs_k - 1) {
            if (k < 0) {
                fkp[(k + zero_k) * max_p + p + 1] = abs_k - 1;
            } else {
                fkp[(k + zero_k) * max_p + p + 1] = -1;
            }
        } else if (p < abs_k) {
            fkp[(k + zero_k) * max_p + p + 1] = -inf;
        } else {
            fkp[(k + zero_k) * max_p + p + 1] = 0;
        }
    }
}

To prevent memory access errors that occure when reading outside the memory area reserved for fkp[] in the cases mentioned, two calls within f() (file br.cc) must be further restricted.
Lines 50-53 becomes to:

 ptrdiff_t ta = -inf;
 if (p >= 0 && k > -zero_k) { // 'p >= 0 &&' can be omitted if the changes from issue#3 have been applied
    ta = fkp[(k - 1 + zero_k) * max_p + p];
 }

Lines 54-57 becomes to:

ptrdiff_t tb = -inf;
if (p >= 0 && k < zero_k) { // 'p >= 0 &&' can be omitted if the changes from issue#3 have been applied
    tb = fkp[(k + 1 + zero_k) * max_p + p] + 1;
}

Best greetings
Olaf

Function `f` in Berghel & Roach will index into the strings at illegal locations

The line:

    while (t < (ptrdiff_t)std::min(m, n - k) && A[t] == B[t+k]) ++t;

(br.cc, line 62) will index into A & B at illegal (both negative & past-the-end). This can be seen by adding a simple assertion that the index arguments are legal ahead of that line:

    assert(t >= (ptrdiff_t)std::min(m, n - k) || (t >= 0 && t <= m && t+k >= 0 && t+k < n));

With this addition, all unit tests that use the algorithm of Berghel & Roach will fail.

Generally, this won't affect the algorithm since calling operator[] with bad indicies will return garbage & the second test will fail, anyway. However, "If pos > size(), the behavior is undefined." per cppreference and when I recently built this project on a new platform, an assertion in basic_string::operator[] fired.