Coder Social home page Coder Social logo

winnow-ruby's Introduction

Winnow

A tiny Ruby library for document fingerprinting.

What is document fingerprinting?

Document fingerprinting converts a document (e.g. a book, a piece of code, or any other string) into a much smaller number of hashes called fingerprints. If two documents share any fingerprints, then this means there is an exact substring match between the two documents.

Document fingerprinting has many applications, but the most obvious one is for plagiarism detection. By taking fingerprints of two documents, you can detect if parts of one document were copied from another.

This library implements a fingerprinting algorithm called winnowing, described by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken in a paper titled Winnowing: Local Algorithms for Document Fingerprinting.

Winnowing is perfect for when you want to precalculate fingerprints about some string. However, due to an issue with how Ruby's String#hash works, you might run into issues. Please see the section on caveats for more info and a workaround.

Installation

Add this to your Gemfile:

gem 'winnow'

Or install it directly:

gem install winnow

Usage

The Fingerprinter class takes care of fingerprinting documents. To create a fingerprint, you need to provide two parameters, called the noise threshold and the guarantee threshold. When comparing two documents' fingerprints, no match shorter than the noise threshold will be detected, but any match at least as long as the guarantee threshold is guaranteed to be found.

The proper values for your noise and guarantee thresholds varies by context. Experiment with the data you're looking at until you're happy with the results.

Creating a fingerprinter is easy:

fingerprinter = Winnow::Fingerprinter.new(noise_threshold: 10, guarantee_threshold: 18)

Then, use #fingerprints get the fingerprints. Optionally, pass :source (default is nil) so that Winnow can later report which document a match is from.

document = File.new('hamlet.txt')
fingerprints = fingerprinter.fingerprints(document.read, source: document)

#fingerprints just returns a plain-old Ruby Hash. The keys of the hash are generated from substrings of the document being fingerprinted. Finding shared substrings between documents is as simple as seeing if they share any of the keys in their #fingerprints hash.

To keep things easier for you, Winnow comes with a Matcher class that will find matches between two documents.

Here's an example that puts everything together:

require 'winnow'

str_a = <<-EOF
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
This is copied.
All mimsy were the borogoves,
And the mome raths outgrabe.
EOF

str_b = <<-EOF
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious -- This is copied. -- Bandersnatch!"
EOF

fprinter = Winnow::Fingerprinter.new(
  guarantee_threshold: 13, noise_threshold: 9)

f1 = fprinter.fingerprints(str_a, source: "Stanza 1")
f2 = fprinter.fingerprints(str_b, source: "Stanza 2")

matches = Winnow::Matcher.find_matches(f1, f2)

# Because 'This is copied' is longer than the guarantee threshold, there might
# be a couple of matches found here. For the sake of brevity, let's only look at
# the first match found.
match = matches.first

# It's possible for the same key to appear in a document multiple times (e.g. if
# 'This is copied' appears more than once). Winnow::Matcher will return all
# matches from the same key in array.
#
# In this case, we know there's only one match (because 'This is copied' appears
# only once in each document), so let's only look at the first one.
match_a = match.matches_from_a.first
match_b = match.matches_from_b.first

p match_a.index, match_b.index # 71, 125

match_context_a = str_a[match_a.index - 10 .. match_a.index + 20]
match_context_b = str_b[match_b.index - 10 .. match_b.index + 20]

# Match from Stanza 1: "e wabe;\nThis is copied.\nAll mim"
puts "Match from #{match_a.source}: #{match_context_a.inspect}"

# Match from Stanza 2: "ious -- This is copied. -- Band"
puts "Match from #{match_b.source}: #{match_context_b.inspect}"

You may find that Matcher doesn't handle your exact use-case. That's not a problem. The built-in matcher.rb file is only about 10 lines of code, so you could easily make your own.

๐Ÿ’ฅ ๐Ÿ’ฃ A major caveat with String#hash ๐Ÿ’ฃ ๐Ÿ’ฅ

In order to avoid algorithmic complexity attacks, the value returned from Ruby's String#hash method changes every time you restart the interpreter:

$ irb
2.0.0p353 :001 > "hello".hash
 => 482951767139383391
2.0.0p353 :002 > exit

$ irb
2.0.0p353 :001 > "hello".hash
 => 3216751850140847920
2.0.0p353 :002 > exit

(This is the case even if you're using JRuby.)

This means that although the winnowing algorithm should allow you to precalculate a document's fingerprints and store them somewhere, doing so in Ruby will not work unless you're careful to make sure you never restart your Ruby runtime.

A workaround

Winnow looks for the presence of a String#consistent_hash method. If it finds one, it'll call that rather than call String#hash. You can therefore describe your own hash function if you want to precalculate fingerprint data.

I've put together a super-simple (but effective) gem called consistent_hash that implements exactly this. It's about a dozen lines of MRI C code and it'll probably work for you as well.

winnow-ruby's People

Contributors

soryu23 avatar ucarion avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.