Coder Social home page Coder Social logo

andy-wagner / spellblaze Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jpsingarayar/spellblaze

0.0 1.0 0.0 2.55 MB

Symmetric Delete spelling correction algorithm using Java

Home Page: http://www.spellblaze.org/

License: GNU General Public License v3.0

Java 96.94% HTML 3.06%

spellblaze's Introduction

SpellBlaze - A million times faster spell correction

Build Status License: GPL v3

A SymSpell implementation for spelling correction & Fuzzy search. It's 1 million times faster through Symmetric Delete spelling correction algorithm

SymSpell

The fundamental beauty of SymSpell is the Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given edit distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Additionally only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input phrase are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

SpellBlaze

SpellBlaze has these features added on top of the Symspell to provide more accuracy,

  1. The Damerau-Levenshtein implemtation forces a edit distance of 1 for all operations. However, sometimes not all edits are created equal. For instance, if you are doing OCR correction, maybe substituting '0' for 'O' should have a smaller cost than substituting 'X' for 'O'. If you are doing human typo correction, maybe substituting 'X' for 'Z' should have a smaller cost, since they are located next to each other on a QWERTY keyboard. So we have implemented the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights.

  2. Included a ported version of a blazingly fast phonetic reduction/hashing algorithm Eudex.

  3. Included a prefix proximity algorithm, JaroWinkler calculation

  4. A fragment similarity component optimized version of the dice coefficient calculation to find the similarity between two strings.

  5. Implemeted the dynamic replacement weight based on the QWERTY keyboard distance(since letters close to each other are more likely to be replaced)

  6. Normalization of query before search.

  7. This can be used to generate a correct single match for a string or a list of suggestions.

  8. Works with multiple words, each word would get its own match with all the similarity scoring.

Installation

Running as a Packaged Application

$ git clone https://github.com/jpsingarayar/SpellBlaze.git

$ mvn clean install

$ java -jar target/sc.jar

Using the Maven Plugin

$ git clone https://github.com/jpsingarayar/SpellBlaze.git

$ mvn clean install

$ mvn spring-boot:run

If run sucessful, the service would be up listening on port 8080.

Test the service by hitting the below url.

http://localhost:8080/spellcheck/spellcheck?q=dissapear

This should give the following response.

{
   "origPhrase": "dissapear",
   "id": 2,
   "correctedPhrase": "disappear",
   "correctedWords": [
      "disappear"
   ],
   "similarity": 0.779589683121326,
   "similarityMap": {
      "disappear": 0.779589683121326
   },
   "responseTime": 0,
   "proximity": null,
   "editDistanceMax": 2
}

For a detailed similarity score between the orginal word and the matched word append a param debug=true to the end of the url.

http://localhost:8080/spellcheck/spellcheck?q=dissapear&debug=true

{
   "origPhrase": "dissapear",
   "id": 3,
   "correctedPhrase": "disappear",
   "correctedWords": [
      "disappear"
   ],
   "similarity": 0.779589683121326,
   "similarityMap": {
      "disappear": 0.779589683121326
   },
   "responseTime": 1,
   "proximity": [
      {
         "count": 45,
         "distance": 1.44,
         "wordFrequency": 0.0000178065989672964,
         "editProximity": 0.28,
         "phoneticProximity": 1,
         "fragmentProximity": 0.875,
         "prefixProximity": 0.9481481313705444,
         "proximity": 0.779589683121326,
         "term": "disappear"
      }
   ],
   "editDistanceMax": 2
}

Contributing to this project

There are lot of scope for imporvement and I would gladly welcome contributors to this projects.

  1. Create a feature branch and commit your changes.
  2. Submit a merge request.
  3. Each and every commit that makes a difference to the project would get a perk :).

spellblaze's People

Contributors

jpsingarayar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.