Coder Social home page Coder Social logo

ling_227_final_project's Introduction

Ling 227 Final Project

History remembers great speakers for their idiosyncracies. From JFK's authenticity and passion to Donald Trump's divisive rhetoric, U.S. presidents are particularly identifiable by their speeches. While nearly any citizen can name the presidents who said "Nothing to fear but fear itself" (Franklin Delano Roosevelt), "Ask not what your country can do for you--ask what you can do for your country" (John F. Kennedy), and "Yes We Can!" (Barack Obama), it remains to be seen whether a language model can do the same. Our project aims to build a language model that determines the most likely speaker of a given input quote. Morover, our project seeks to investigate how the model discernability changes when trained on three types of data: presidential speeches, part of speech tags for presidental speechs, and politicans' tweets.

Dependecies

pip install scipy
pip install numpy
pip install nltk

Running the Bigram Language Model

To parse and clean the tweets, look at tweet_parse.py. It parses a directory for all the .txt files and returns a list of all cleaned and parsed sentences from all the tweets.

To parse and clean the speeches, look at tok.py. It parses a directory for all the .txt files and returns a list of all cleaned and parsed sentences from all the speeches.

To smooth bigram model with good turing smooth, look at good_turing.py. It applies good_turing smoothing using a logarithmic function.

To train model and find probabilities of sentences using tweets:

python ngram.py -good_turing -tweet twitter/TWEET.txt 2 tweet_quotes.txt

To train model and find probabilities of sentences using speeches:

python ngram.py -good_turing -speech DIR 2 presidential_quotes.txt

To use the Hidden Markov Model on the tweets/speeches:

python ngram.py -good_turing -hmm DIR 2 hmm_quotes.txt

Results

For the tweets: alt text

For the speeches: alt text

Authors

See also the list of contributors who participated in this project.

Works Cited

  • Gale, W. and G. Sampson. Good Turing Estimation Without Tears. Journal of Quantitative Linguistics, vol. 2, 217-237, 1995.
  • Japi, A. Mimicking Writing Style With Markov Chains. The Sopranos, Silicon Valley, and Summer Afternoons. http://aakashjapi.com/mimicking-writing-style-with-markov-chains/
  • Mosteller, F. and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, MA., 1964.
  • Zheng, R., Li, J., Chen, H. and Huang, Z. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci., 57: 378โ€“393, 2006.

ling_227_final_project's People

Contributors

ketruong avatar lbenz730 avatar wlanghorne avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.