Coder Social home page Coder Social logo

cs220-plagiarism-detector's Introduction

Plagiarism Detector

This assigned is based on this "nifty" assignment

Code

The code is on github here. Fork the repo!

There is an API (basically, an interface) for you to implement called IPlagiarismDetector. Your code goes in PlagiarismDetector.

Documents

There are 1,800 documents, ranging from a few dozen words to over 1,000 words, with about 418 words the average length of a document.

Some notes on the data format:

  • The documents are split into sentences, with one sentence per line.
  • Treat punctuation as a separate word. This is a common approach when processing text.
  • Process the sentences one at a time. This means that n-grams should not "span" a line. So once you get to the end of a line of text, stop processing, read the next line, and start again.
    • An easy way to do this is with a Scanner and a while loop using the hasNextLine() and nextLine() methods.
    • Convert each line you've read with the Scanner into String[] like this: line.split(" ") and then you can write for loops to go through the array.
  • If any line has fewer words than the number of n-grams (so a sentence with 2 words when N is 3), just skip it. Real data is messy and we have to make decisions like this all the time; it will be fine.
  • The data files are taken from a Kaggle contest, which is why they are restricted to only Knox students, as I'm not supposed to publicly redistribute them.

The documents for testing are NOT stored in the Github repo. This is because it is not a good practice to store lots of data on Github. Instead, find the documents on Google Drive here.

Download the zipfile, and then extract or unzip it into your Eclipse project. It should look like this:

Eclipse project file structure

Your Eclipse folder structure must look exactly like this. If you have an extra folder, so it's "docs/docs/tinydocs" instead of "docs/tinydocs", then you have to fix it by moving folders around. (This often happens when extracting on Windows.) If your extracted/unzipped documents are not in your Eclipse folder, none of the test cases will be able to find them.

What to submit

  • A link to your Github repo
  • A Google Doc that answers the following questions:
    • Which documents, if any, seem like they were plagiarized?
    • What values of N did you try, and what effects did you observe from these different values?
    • What was a good threshold for number of n-grams in common for two documents to seem suspicious?
      • Remember, the documents vary in length, so having a certain number of n-grams in common does not automatically mean there was plagiarism! It just means the documents seem suspicious and require a human to look at them.
    • What kind of runtimes are you looking at to process all 1800 documents? It takes about a minute on my machine.

cs220-plagiarism-detector's People

Contributors

jspacco avatar thebees86 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.