Coder Social home page Coder Social logo

schiffma / distlib Goto Github PK

View Code? Open in Web Editor NEW
21.0 1.0 1.0 1.77 MB

Distance related functions (Damerau-Levenshtein, Jaro-Winkler , longest common substring & subsequence) implemented as SQLite run-time loadable extension. Any UTF-8 strings are supported.

License: MIT License

C++ 7.75% C 92.09% Batchfile 0.13% Shell 0.02%
sqlite3 sqlite extension levenshtein-distance damerau-levenshtein-distance longest-common-subsequence longest-common-substring jaro-winkler jaro-winkler-distance permutation

distlib's Introduction

distlib

Alt text

Distance related functions implemented as SQLite run-time loadable extension. Any UTF-8 strings are supported.

Functions provided:

SQLite function Description
ldist(x,y) Levenshtein distance as integer >= 0
lsim(x,y) Levenshtein similarity as double between 0.0 and 1.0
dldist(x,y) Damerau–Levenshtein distance as integer >= 0
dlsim(x,y) Damerau–Levenshtein similarity as double between 0.0 and 1.0
jsim(x,y) Jaro similarity as double between 0.0 and 1.0
jwsim(x,y) Jaro-Winkler similarity as double between 0.0 and 1.0
lcstr(x,y) Longest common substring
lcstrl(x,y) Longest common substring length
lcseq(x,y) Longest common subsequence
lcseql(x,y) Longest common subsequence length
perm(x) Permutation table-valued function for strings up to len(x)>=10
subseq(x) Subsequences table-valued function for strings up to len(x)>=16

Examples:

Query example Result example
select ldist('kitten','sitting'); 3
select lsim('kitten','sitting'); 0.571428571428571
select ldist('a cat','an act'); 3
select dldist('a cat','an act'); 2
select lsim('a cat','an act'); 0.5
select lsim('a cät','an äct'); 0.5
select dlsim('a cat','an act'); 0.666666666666667
select ldist('fish', 'ifsh'); 2
select dldist('fish', 'ifsh'); 1
select ldist('kitten','sitting'); 3
select jsim('Clark Kent','Claire Kent'); 0.869023569023569
select jwsim('Peter Parker','Pete Parker'); 0.928787878787879
select jwsim('Péter Pärker','Péte Pärker'); 0.928787878787879
select lcstr('carpenter', 'sharpener'); arpen
select lcstrl('carpenter', 'sharpener'); 5
select lcseq('cärpenter', 'shärpener'); ärpener
select lcseql('cärpenter', 'shärpener'); 7
select * from perm('ABCDEFGH'); ABCDEFGH,ABCDEFHG, ... n! rows
select * from perm ('Smørrebrød'); Smørrebrød,Smørrebødr, ... n! rows
select * from subseq('ABCDEFGH'); A,AB,ABC, ... max. 2^n-1 rows

Bulk examples:

WITH orig_ AS (SELECT 'Geräte' AS orig),
perm_ AS (SELECT orig, permut FROM perm(orig) JOIN orig_ ON 1=1)
SELECT * FROM (
SELECT orig, permut, lcstr(orig, permut) lcstr, lcstrl(orig, permut) lcstrl,
lcseq(orig, permut) lcseq, lcseql(orig, permut) lcseql FROM perm_
) ORDER BY lcstrl desc, lcseql desc;

DROP TABLE IF EXISTS perm_gen;
CREATE TABLE IF NOT EXISTS perm_gen AS
WITH orig_ AS (SELECT 'Z''Hansrüed' AS orig),
perm_ AS (SELECT orig, permut FROM perm(orig) JOIN orig_ ON 1=1)
SELECT * FROM perm_;

DROP TABLE IF EXISTS perm_test;
CREATE TABLE IF NOT EXISTS perm_test AS
SELECT orig, permut, lcstr(orig, permut) lcstr, lcstrl(orig, permut) lcstrl, lsim(orig, permut) lsim,
dlsim(orig, permut) dlsim, jsim(orig, permut) jsim, jwsim(orig, permut) jwsim FROM perm_gen;

DROP TABLE IF EXISTS subseq_gen;
CREATE TABLE IF NOT EXISTS subseq_gen AS
WITH orig_ AS (SELECT 'Lückenbüßer' AS orig),
subseq_ AS (SELECT orig, subseq FROM subseq(orig) JOIN orig_ ON 1=1)
SELECT * FROM subseq_;

DROP TABLE IF EXISTS subseq_test;
CREATE TABLE IF NOT EXISTS subseq_test AS
SELECT orig, subseq, lcstr(orig, subseq) lcstr, lcstrl(orig, subseq) lcstrl, lsim(orig, subseq) lsim,
dlsim(orig, subseq) dlsim, jsim(orig, subseq) jsim, jwsim(orig, subseq) jwsim from subseq_gen;

How to build the .dll/.so library?

The library was primary built on Windows 10 64-bit with Visual Studio Community 2019 and tested with DB Browser for SQLite.
Smoke testing was also done on Linux Mint 19.3 "Tricia" and the latest Raspberry Pi OS on a Raspberry Pi 400.

Compile on Windows by reviewing & calling win64build_extensions_generic.bat :

cl /EHsc /FojaroWinkler.obj /c jaroWinkler.cpp
cl /EHsc /Fopylcs.obj /c pylcs.cpp
cl /EHsc /Fdldist.obj /c dldist.cpp
cl /EHsc /Flcsubstr.obj /c lcsubstr.cpp
cl /EHsc /Foperm.obj /c perm.cpp
cl /EHsc /Fosubseq.obj /c subseq.cpp
cl /EHsc /FoRegistExt.obj /c RegistExt.cpp
cl /EHsc /Foutf8_unicode.obj /c utf8_unicode.cpp
link /DLL /OUT:distlib_64.dll utf8_unicode.obj RegistExt.obj perm.obj subseq.obj jaroWinkler.obj pylcs.obj dldist.obj lcsubstr.obj

Compile on Linux/Raspberry Pi OS in your default shell :

chmod 755 linux_build_extensions.sh
./linux_build_extensions.sh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.