Coder Social home page Coder Social logo

erl_fuzzy_match's Introduction

Erlang Fuzzy String Matcher

The Fuzzy String Matcher for Erlang pulls together a handful of algorithms to make fuzzy string matching available as a library to your Erlang programs.

This little piece of code was inspired by the Python code presented in http://www.sportshacker.net/posts/fuzzy_string_matching.html

Overview

The fuzzy matcher runs as a node-local, named gen_server that keeps a dictionary of translations. If a string is not found in the lookup dictionary, it is sent through a list of string matching algorithms (abbreviations, levenshtein, tokenized). A valid match is added as a new entry in the dictionary. If no match was found, the string is taken as a new entry in the dictionary, matching itself.

An example, using English Premier league football team names ('14/'15 season):

1> Teams = [
<<"Arsenal">>,
<<"Aston Villa">>,
<<"Burnley">>,
<<"Chelsea">>,
<<"Crystal Palace">>,
<<"Everton">>,
<<"Hull City">>,
<<"Leicester City">>,
<<"Liverpool">>,
<<"Manchester City">>,
<<"Manchester United">>,
<<"Newcastle United">>,
<<"Queens Park Rangers">>,
<<"Southampton">>,
<<"Stoke City">>,
<<"Sunderland">>,
<<"Swansea City">>,
<<"Tottenham Hotspur">>,
<<"West Bromwich Albion">>,
<<"West Ham United">>
].
[<<"Arsenal">>,<<"Aston Villa">>,<<"Burnley">>,
 <<"Chelsea">>,<<"Crystal Palace">>,<<"Everton">>,
 <<"Hull City">>,<<"Leicester City">>,<<"Liverpool">>,
 <<"Manchester City">>,<<"Manchester United">>,
 <<"Newcastle United">>,<<"Queens Park Rangers">>,
 <<"Southampton">>,<<"Stoke City">>,<<"Sunderland">>,
 <<"Swansea City">>,<<"Tottenham Hotspur">>,
 <<"West Bromwich Albion">>,<<"West Ham United">>]
2> OddNames = [
{<<"Saints">>, <<"Southampton">>},
{<<"Spurs">>, <<"Tottenham Hotspur">>}
].
[{<<"Saints">>,<<"Southampton">>},
 {<<"Spurs">>,<<"Tottenham Hotspur">>}]
3> erl_fuzzy_match:start_link(premier, dict:from_list(OddNames), Teams).
{ok,<0.36.0>}
4> erl_fuzzy_match:translate(premier, <<"Chelsea">>).
<<"Chelsea">>
5> erl_fuzzy_match:translate(premier, <<"Manchester Utd">>).
<<"Manchester United">>
6> erl_fuzzy_match:translate(premier, <<"West Ham">>).
<<"West Ham United">>
7> erl_fuzzy_match:translate(premier, <<"Newcastle">>).
<<"Newcastle United">>
8> erl_fuzzy_match:translate(premier, <<"QPR">>).
<<"Queens Park Rangers">>

But some caution may still be required (!):

9> erl_fuzzy_match:translate(premier, <<"W.B.A">>).
<<"W.B.A">>
10> erl_fuzzy_match:translate(premier, <<"Bristol City">>).
<<"Hull City">>

Starting And Stopping

A fuzzy matcher is started by one of the start_link functions:

start_link(Name)
start_link(Name, Dict, Canon)

where Name is an atom, Dict a mapping from UTF-8 binary strings to UTF-8 binaries strings ( dict(binary(),binary()) ), and Canon a list of UTF-8 binary strings.

Multiple fuzzy matchers may run at the same time. The name supplied at startup allows selecting which fuzzy matcher to use when using the other API calls.

A list of known canonical strings may be provided when starting the fuzzy matcher, otherwise it is initially empty. This can be done by either providing an empty dictionary and a list of canonical UTF-8 binary strings, a pre-populated dictionary of known matches (from UTF-8 binary strings to UTF-8 binary strings), or a mixture of the two.

Under normal operations it is usually desirable to enter new, unmatched strings into the list of canonical strings and into the dictionary. If the extension of the initial set of canonical strings and mappings is not desired, an option [fixed] may be provided via the additional start function:

start_link(Name, Dict, Canon, [fixed])

A fuzzy matcher may be stopped either explicitly via the stop function

stop(Name)

A fuzzy matcher may also get started as part of a supervision tree, in which case the normal supervision shutdown sequence will lead to graceful termination of the fuzzy matcher.

String Translation

A subject string may be fuzzily matched and translated into canonical form using

translate(Name, S)
translate(Name, S, Matchers)

where Name is an atom (the given name of the fuzzy matcher when it was started), S is a UTF-8 binary string to translate, and Matchers is an optional list of functions to use as the fuzzy matching algorithms.

A matcher function must obey the type

Matcher = fun(S :: binary(), [Canon :: binary()]) -> undefined | {ok, Name :: binary()}

that is, take the subject string as a UTF-8 binary and a list of canonical UTF-8 binary strings returning either the atom undefined if no match was found or {ok, Canonical} if Canonical was found as a suitable match for the subject.

If a list of matcher functions is provided, they are tried in the order provided in the list, if a matcher function returns 'undefined', the next one on the list is tried.

The default list of match functions are:

  • exact match (via the dictionary)
  • abbreviation match (using this algorithm)
  • Levenshtein difference between subject and canonical string is less than or equal to two (2)
  • all tokens (separated by whitespace) of subject and canonical matched according to exact match or levenshtein

If no matcher function finds a match, the translate/2,3 functions return the original subject string. Additionally, if the start option [fixed] was not provided, the subject string is entered into the list of canonical strings.

Data Access

The list of matcher functions is returned by

matchers(Name)

The current lookup dictionary is returned by

dict(Name)

The current list of canonical strings is returned by

canonicals(Name)

Possible Extensions

  • Configurability of the Levenshtein distance threshold.
  • Upfront configurability of the default matcher sequence.
  • Upfront configurability of externally defined matchers.
  • Lowercase strings before matching.
  • Strip out punctuation characters before matching.

erl_fuzzy_match's People

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.