Coder Social home page Coder Social logo

textmatchingsimilarity's Introduction

Matching two JSON files using Text Similarity

Challenge

The challenge is to determine when two pieces of information from different data sources are actually talking about the same.

For example,you have to compare two similiar products "Sony_Cyber-shot_DSC-W310" and "Sony DSC-W310 12.1MP Digital Camera ...", and using text similiraty algorithms match them.

However there are some requirements:

  • Precision – do you make (m)any false matches?
  • Recall – how many correct matches did you make?
  • Efficiency (data structure and algorithm choices)

First approach

After a while analysing carefully every piece of the problem, I made some reasearches about the academic names quoted in the challenge description.

Record Linkage and Entity Resolution:

Comparing some of them, I decided to pick up commonly used algorithms and some of their variations. Based on the idea to match string I found rich materials about the pros and cons of each algorithm. Finally, I chose these:

The Wagner–Fischer algorithm is a better version of Levenshtein Distance algorithm (also called as Edit Distance) using dynamic programing, it calculates the edit distance between two strings. The least distance should be the best match.

After looking deeply into some articles about the both algorithms, the both were strongly recommended to match strings. Beyond that if you made some modifications you should improve the original version to a better one and reduce the time complexity. For this, you just have to use two matrix rows and calculate only the diagonal values as described in the article bellow.

When I finished the implementation and runned the code, it worked well however the results and the running time were really bad. So, I figured out the idea to compare the distance from each strings in thousands of lines didn't seem to be the best aprroach.

These algorithms work well for estimating distance between shorter strings that differ largely at character level, they become too computationally expensive and less accurate for larger strings and that was problem, long strings.

Then, I reviewed the problem and the requirements. During this process I though why not match the words instead of letters. Looking to the files you can see strong characteristics in some attributes (manufacturer, model) which should define a product properly. Then, I started to look a best approach.

Best approach

Again, I started to look at new algorithms to find a more appropriate solution and while I was revising some previous article about String Metrics I found Jaccard Similarity Coefficient.

Jaccard Similarity can be used as the simplest method for computing likeness as the proportion of tokens shared by both strings. In summary, it is a token based vector used for comparing similarity and differences between sets of data, in that case set of words(strings).

Conclusion

Finally, after changing some codes, I could get better results and reduced the running time in 75% comparing with the first version.

textmatchingsimilarity's People

Contributors

fpauer avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.