Coder Social home page Coder Social logo

chiragjn / short-text-similarity Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 1.0 9 KB

Short Text Similarity as described in https://dl.acm.org/citation.cfm?id=2806475

License: MIT License

Python 100.00%
short-text-semantic-similarity sts semantic-similarity word-vectors word-embeddings text-similarity

short-text-similarity's Introduction

Short Text Similarity with word embedding vectors


Quick Implementation of STS model as described in Tom Kenter & Maarten de Rijke - Short Text Similarity with Word Embeddings

Caution: Few assumptions were made as they were a bit unclear in the paper

KenterSTS module in sts.py has short python doc referring to hyperparams in paper

See main part of sts.py for sample usage

from sts import *

sample_data = [(u'hello', u'hi'), (u'i like this', u'i hate it')]
sample_labels = [1, 0]
sample_weights = {u'hello': 1, u'hi': 1, u'i': 0.1, u'like': 1, u'this': 0.5, u'hate':0.9, u'it': 0.5}
_docs = []
for a, b in sample_data:
    _docs.append(a.split())
    _docs.append(b.split())
sample_unk_weight = 0.5
sample_vectorizer = GensimWordVectorizer(FastText(_docs, min_count=1)) # plug any gensim word vector model here
model = KenterSTS(sample_weights, sample_unk_weight, vectorizer=sample_vectorizer)
model.fit(sample_data, sample_labels)
model.save('test_save')
model = KenterSTS.load('test_save')
model.set_vectorizer(sample_vectorizer)
print("Test Prediction:", model.predict([(u'hello', u'hi')]))
os.remove('test_save')
os.remove('test_save.sklearn')

Additional Caution


This module works with dense vectors only!

Considering the default params generate about 15 feature per pair this shouldn't be a problem for moderately large datasets

If you are increasing bins and have a huge dataset, consider modifying the code to work with sparse matrices.

short-text-similarity's People

Contributors

chiragjn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ldmax

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.