Coder Social home page Coder Social logo

sp1ff / damerau-levenshtein Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 1.0 158 KB

Comparison of a few algorithms for computing Damerau–Levenshtein distance

License: GNU General Public License v3.0

Makefile 2.70% Shell 60.59% M4 1.31% C++ 35.40%
damerau-levenshtein damerau-levenshtein-distance cpp string-matching string-distance

damerau-levenshtein's Introduction

Comparing Damerau-Levenshtein Implementations

Introduction

This repository contains three implementations of Damerau-Levenshtein string distance computation in C++ 17.

The Program

This package builds a program imaginatively named dl. It is installed via the usual autotools incantations. dl will compute Damerau-Levenshtein distance over a corpus of input strings provided on the command line with one of three algorithms (on which more below). For performance benchmarking, you can specify that the program should run over the corpus repeatedly and/or print timing information on completion. Say dl --help for full usage.

Discussion

The Damerau-Levenshtein distance between two strings A & B is the minimal number of insertions, deletions, single-character changes & transpositions needed to transform A into B (e.g. “act” -> “cat” -> “cart”, so the D-L distance between “act” & “cart” is two). In his original paper [1] Damerau claimed that 80% of the errors in the system which gave rise to his work could be accounted for by one of these four errors.

While it’s a popular algorithm for fuzzy searching, there doesn’t seem to be a “go-to” implementations out there in C/C++ (it is a feature request for the boost string library). There are several others on Github, but the ones I examined lacked both documentation on the algorithms used as well as test suites.

This repo contains implementations of what I understand to be the three major algorithms implementing Damerau-Levenshtein distance in C++.

Lowrance & Wagner’s Recursive Relation

This is the algorithm laid out in Wikipedia (at the time of this writing). It is due to Lowrance & Wagner [2] and seems to be the most widely known. In their paper, they prove that under a few (reasonable) conditions, a simple |A| x |B| recurrence relation is sufficient to compute the D-L distance between strings A & B. Their algorithm runs in O(|A|*|B|) time & space.

Ukkonen

Ten years later, Ukkonen substantially improved the performance of this calculation [3]. His paper contained two major advancements. He proved (again under conditions) that in order to compute the D-L distance, one need only compute the recurrence relation in a fairly tight band around its main diagonal (substantially decreasing the number of operations required). Next, he moved from the primary recurrence relation of Lowrance & Wagner to a dual problem of computing f(k,p) which is defined as the maximal index i on diagonal k for which the edit distance between A(1..i) & B(1..j) is p; this doesn’t reduce the time complexity but does reduce that of space. His algorithm runs in O(s*min(|A|,|B|)) (where s is the D-L distance between A & B) and space O(min(s,|A|,|B|))).

Berghel & Roach

Eleven years on, Berghel & Roach improved Ukkonen’s algorithm by deriving even tighter bounds on the region of the region of the recurrence relation that needs to be computed, leading to (in their application) a 42% speedup over Ukkonen. If s is the edit distance from A to B, and m & n their respective string lengths, and WLOG n >= m, define p to be 1/2s - 1/(2*(n-m)). Then the worst-case running time of this algorithm is O(n*p) and the expected running time is O(n+(p*s)).

After reading this paper, I spent some time searching the literature for references to it (looking for further improvements). I didn’t find any directly-related improvements. Later references either cited it as the best known to the author(s) for computing Damerau-Levenshtein distance, or as a point of interest in developing different string metrics.

In my benchmarking, I’ve found it important to avoid the use of std::vector, which surprises me. I’ve instead just used flat arrays & done the two-dimensional indexing manually.

All the other references check for common prefixes before starting the algorithm, which I haven’t tested, yet.

References

  1. Fred J. Damerau, A Technique for Computer Detection and Correction of Spelling Errors, Communications of the ACM, 7 (1964) No. 3, 171-176.
  2. Roy Lowrance and Robert A. Wagner, An Extension of the String-to-String Correction Problem, Journal of the Association for Computing Machinery, 22 (1975), No 2, 177-183.
  3. Esko Ukkonen, Algorithms for Approximate String Matching, Information and Control, 64 (1985) 100–118.
  4. Hal Berghel and David Roach, An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm, ACM Transactions on Information Systems, 14 (1996) No. 1, 94-106.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.