Coder Social home page Coder Social logo

lexrank-summarizer's Introduction

LexRank Summarizer

This is a Spark-based extractive summarizer, based on the LexRank algorithm. It extracts a 5 sentence summary from each document in the corpus.

Boilerplate sentences are detected across all documents in the corpus using frequent sign-random projection locality-sensitive hashing (SRP-LSH) signatures, and detected within documents in the corpus using estimations of cosine similarity based on the SRP-LSH signatures.

For an explanation of the pooling trick used in this SRP-LSH implementation, see Online Generation of Locality Sensitive Hash Signatures.

Usage

Build a JAR file from the source with sbt assembly. Submit a job to Spark with:

spark-submit --class io.github.karlhigley.lexrank.Driver <path to jar file> [options]

Options:
-i PATH,  --input PATH         Relative path of input files (default: "./input")
-o PATH,  --output PATH        Relative path of output files (default: "./output")
-s VALUE, --stopwords VALUE    Number of stopwords to remove (default: 250)
-l VALUE, --length VALUE       Number of sentences to extract from each document (default: 5) 
-b VALUE, --boilerplate VALUE  Similarity cutoff for cross-document boilerplate filtering (default: 0.8)
-t VALUE, --threshold VALUE    Similarity threshold for LexRank graph construction (default: 0.1)
-c VALUE, --convergence VALUE  Convergence tolerance for PageRank graph computation (default: 0.001)

File Formats

The summarizer expects tab-separated text files with each document on a single line. Each line should contain a document identifier in the first column and the document text in the second column.

Outputs are formatted similarly, but with the text of a single sentence in the second column.

lexrank-summarizer's People

Contributors

karlhigley avatar

Watchers

James Cloos avatar Purgna avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.