Coder Social home page Coder Social logo

anukat2015 / sentence-compression Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cnap/sentence-compression

0.0 3.0 0.0 3.92 MB

Text ANalytics -> Sentence-level compressions via deletion. It is a modified implementation of the ILP model described in Clarke and Lapata, 2008, "Global Inference for Sentence Compression: An Integer Linear Programming Approach".

Shell 2.16% Perl 1.26% Java 96.58%

sentence-compression's Introduction

#Sentence compression Courtney Napoles, [email protected]

last updated 18 September 2015

ABOUT

This program generates sentence-level compressions via deletion. It is a modified implementation of the ILP model described in Clarke and Lapata, 2008, "Global Inference for Sentence Compression: An Integer Linear Programming Approach".

SETUP

ant compile

ILOG CPLEX needs to be installed to run, and the paths in build.xml and compress should be updated accordingly.

RUN

	Usage: ./compress -i path/to/input -l path/to/lm [-x]
	  -i val  input file or directory
	  -d      debug
	  -l val  path to language model (binary or arpa)
	  -t      output should be <= 120 characters
	  -q      suppress cplex output (normally goes to stderr)
	  -x      input file(s) in xml format

INPUT

The program expects tokenized text with one sentence per line.

OUTPUT

<orig_len> <short_len> <compression> <orig_indices> <compression_rate>

For example, for the input sentence "At the camp , the rebel troops were welcomed with a banner that read : `` Welcome home . ''", the output is as follows:

20 8 At camp , the troops were welcomed . 1 3 4 5 7 8 9 19 0.4

JAVA CLASS

To generate extractive compressions (by deletion only) using an extended version of Clarke & Lapata (2008)'s ILP model:

java research.compression.SentenceCompressor
   Required arguments:
     -in=val		path to the input file or directory
     -lm=val		path to the language model (trigram)
   Optional arguments:
     -char		use character-based constraints
     -cr=val		minimum compression rate (default is 0.4)
     -debug             debug
     -l=val		specify lambda value (tradeoff between n-gram probability and
     			"significance" score in objective function
     -ngram		use the n-gram constraint (each n-gram in compression present in
     			Google n-grams; n-gram server must be running.
     -quiet             supress cplex output
     -target=val	specify the target compression length for each sentence
     -test_lambda	test varying values of lambda (for dev)
     -tweet		use a Twitter length constraint (120 characters)
     -xml		input is in xml format	 

Example call:

java -Xms2g -Xmx10g -Djava.library.path=$ILOG/bin/x86-64_osx \
   -cp bin:lib/berkeleylm.jar:$ILOG/lib/cplex.jar:lib/stanford-parser.jar \
   research.compression.SentenceCompressor -in=data/sample_text -lm=your_lm.gz

LANGUAGE MODEL

The language model used is not provided for licensing issues. This software requires a trigram language model in ARPA format. In our research, we used a language model trained on English Gigaword 5 using SRILM. There are some language models available for download from the following sites. Note that I have not tested or used these models myself.

The LM reader used by this program expects each n-gram line to be in the format log_prob<TAB>ngram<TAB>backoff

If there is no backoff weight, then the format should be log_prob<TAB>ngram

If you get a String index out of range error, and your LM is in ARPA, the fields may be space separated (instead of tab separated), or have trailing spaces. I have added a script, fix_spacing.pl to fix this issue. To run this script, call

zcat your_lm.gz | perl fix_spacing.pl | gzip > your_fixed_lm.gz

sentence-compression's People

Contributors

cnap avatar

Watchers

James Cloos avatar anukat2015 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.