Coder Social home page Coder Social logo

text-matcher's Introduction

text-matcher

PyPI version DOI

A simple text reuse detection CLI tool. Given a pair of texts or directories of texts, it will find similar text between them. This is good for detection of text reuses such as citation, quotation, intertextuality, and plagiarism.

The pilot experiment that uses this tool is allusion-detection. A new project that uses this tool is middlemarch-critical-histories.

Demo

Does Milton quote from the Bible in his Areopagitica? Let’s find out.

$ text-matcher kjv.txt areopagitica.txt 

1 total matches found.

match 1:
kjv.txt: (4135539, 4135561) Spirit. 5:20 Despise not prophesyings Prove all things; hold fast that which is good. 5:22 Abstain
areopagitica.txt: (25861, 25883) answerable to that of the Apostle to the Thessalonians PROVE ALL THINGS, HOLD FAST THAT WHICH IS GOOD. And he might

Usage

Just run text-matcher and provide the names of the text files you want to compare. You can also provide a directory of files instead of a single file, so if you want to compare textA.txt with every text file in textdir/, run text-matcher textA.txt textdir/.

You can also tweak the matching by providing the ngrams value to match against, and the threshold. From the help:

$ text-matcher --help
Usage: text-matcher [OPTIONS] TEXT1 TEXT2

  This program finds similar text in two text files.

Options:
  -t, --threshold INTEGER    The shortest length of match to include in the
                             list of initial matches.
  -c, --cutoff INTEGER       The shortest length of match to include in the
                             final list of extended matches.
  -n, --ngrams INTEGER       The ngram n-value to match against.
  -m, --mindistance INTEGER  The minimum value for distance between two
                             match.
  -l, --logfile TEXT         The name of the log file to write to.
  --stops                    Include stopwords in matching.
  --verbose                  Enable verbose mode, giving more information.
  --help                     Show this message and exit.

Installation

You can install text-matcher using pip:

pip3 install --user text-matcher

Or globally, with sudo:

sudo pip3 install text-matcher

Alternatively, clone this repo and install locally, using pip:

git clone https://github.com/JonathanReeve/text-matcher
cd text-matcher
pip install .

Or with Pipenv:

git clone https://github.com/JonathanReeve/text-matcher
cd text-matcher
pipenv install .
pipenv run text-matcher

Citation

If you use text-matcher in your research, you can cite it like this, for now:

@misc{Reeve2020,
  author = {Reeve, Jonathan},
  title = {Text-Matcher},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/JonathanReeve/text-matcher}},
  commit = {988d9422a63165225ea136fc31427b1e57814505},
  doi = {10.5281/zenodo.3937738}
}

text-matcher's People

Contributors

jonathanreeve avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.