cmyr / anagramatron Goto Github PK
View Code? Open in Web Editor NEWtwitter anagram hunter
twitter anagram hunter
I want this to be more more fully isolated from the rest of the project, so that I can swap in another implementation in the future.
This would be notifying me any time there's an unhanded exception, basically.
this seems more sane then gdbm, maybe.
https://plyvel.readthedocs.org/en/latest/
https://github.com/google/leveldb
Saw this project on reddit and thought it was really awesome. Keep up the good work, its a great a idea. Anyway, I saw that you said you were manually rejecting "false" hits which included identical tweets or transpositions of few letters and I thought to myself that could be an easy fix.
I wrote a simple abstracted implementation to remove some of the "false" hits (in javascript) of this mainly because I was too lazy to look how to implement it in python or how to integrate it with your project and submit a PR haha.
Hopefully it helps. Maybe at least it keeps you from having to think of a solution haha. Awesome project!
Carlos
//tweets would have to be stripped of things you exclude mentioned in your readme like @'s
function tooSimilar(tweet1, tweet2) {
var LD_THRESHOLD = 4, //max changes between the tweets (insertions, deletions, transpositions)
TWEET_WORD_SIMILARITY = 3, //max number of words in common
matchCount = 0;
tweet1.split(' ').map(function(word) {
//ignore words like pronouns, articles, etc.
if(IGNORED_WORDS_HASH[word] === undefined && tweet2.indexOf(word) > -1) {
matchCount++;
}
});
//lev_Dist can be expensive so we want to kick out early if possible
if(matchCount > TWEET_WORD_SIMILARITY) {
return true;
}
var lev_Dist = levenshtein(tweet1, tweet2);
return (lev_Dist < LD_THRESHOLD);
}
//credit: http://stackoverflow.com/questions/11919065/sort-an-array-by-the-levenshtein-distance-with-best-performance-in-javascript
//info: http://en.wikipedia.org/wiki/Levenshtein_distance
function levenshtein(s, t) {
var d = []; //2d matrix
// Step 1
var n = s.length;
var m = t.length;
if (n == 0) return m;
if (m == 0) return n;
//Create an array of arrays in javascript (a descending loop is quicker)
for (var i = n; i >= 0; i--) d[i] = [];
// Step 2
for (var i = n; i >= 0; i--) d[i][0] = i;
for (var j = m; j >= 0; j--) d[0][j] = j;
// Step 3
for (var i = 1; i <= n; i++) {
var s_i = s.charAt(i - 1);
// Step 4
for (var j = 1; j <= m; j++) {
//Check the jagged ld total so far
if (i == j && d[i][j] > 4) return n;
var t_j = t.charAt(j - 1);
var cost = (s_i == t_j) ? 0 : 1; // Step 5
//Calculate the minimum
var mi = d[i - 1][j] + 1;
var b = d[i][j - 1] + 1;
var c = d[i - 1][j - 1] + cost;
if (b < mi) mi = b;
if (c < mi) mi = c;
d[i][j] = mi; // Step 6
//Damerau transposition (checks for transposition of letters e.g. haet and hate have a Damerau-Lev distance
// of 1 instead of Lev distance of 2). Can be removed for optimization
if (i > 1 && j > 1 && s_i == t.charAt(j - 2) && s.charAt(i - 2) == t_j) {
d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
}
}
}
// Step 7
return d[n][m];
}
because no keys are in memory at launch, hit processing is really slow and we always end up using all of our buffer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.