Coder Social home page Coder Social logo

salimandre / n-gram-language-models Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 425 KB

language modeling, bigram, trigram, mixture of language models, perplexity, Gutenberg project

Python 100.00%
gutenberg-project nlp language-model bigram trigram ngrams mixture-model perplexity machine-learning

n-gram-language-models's Introduction

N-gram models for language modeling

We simply built n-gram models which are simple baseline language models. More precisely we built bigram and trigram models. Raw n-gram models cannot be evaluated on a test set as they overfit training text data too much. Hence it is required to use a smoothing technique (regularization). We chose an intuitive one by computing the convex mixture between an ngram model and the uniform distribution over vocabulary as the convex mixture of probability distributions remain a probability distribution.

As for dataset we simply used the book Pride and Prejudice of Jane Austen which appears to be the most downloaded ebooks from Gutenberg project.

We then performed 2 tasks:

  • language model evaluation by computing perplexity on test set
  • text generation using raw bigram model

Dataset

Data: ebook Pride and Prejudice by Jane Austen

Training: 60 first chapters roughly 125 000 tokens

Test: last chapter roughly 1250 tokens

Vocabulary

We considered the 1000 most frequent words to buid our vocabulary. Those which were not among them were considered as being out of vocabulary and replaced by token in data set. Hence we had 13.1% of OOV words among whole data set.

Without surprise we can observe Zipf's law on the plot of word frequencies.

Preprocessing

Raw data sample:

      “How so? How can it affect them?”

      “My dear Mr. Bennet,” replied his wife, “how can you be so
      tiresome! You must know that I am thinking of his marrying one of
      them.”

We only performed basic NLP preprocessing as we want to model language properly.

Training data sample:

<S> how so how can it <OOV> them </S> <S> my dear mr bennet replied his wife how can you be so <OOV> you must know that i am thinking of his marrying one of them </S>

Models

Here is the list of models we designed:

Evaluation

We used perplexity which is the usual metric to evaluate language models, whose formula is the following:

Model Mixture parameters Perplexity on train set Perplexity on test set
Smoothed bigram lambda = 0.2 44 91
Smoothed trigram lambda = 0.5 14 180
bigram/trigram mixture lambda = 0.2, alpha = 0.8 17 84

Text generation

sample of generated text:

<S> i cannot bear to speak with the <OOV> must be <OOV> by <OOV> for myself with grateful respect towards anybody connected with his sister and with more <OOV> than i <OOV> said in a tone which had always <OOV> about him </S> <S> </S> <S> oh my dear father i <OOV> other <OOV> elizabeth hoped it might <OOV> <OOV> you may suppose of my <OOV> till <OOV> put an end to by any <OOV> appearance of composure elizabeth merely <OOV> but though not in her little <OOV> <OOV> and the evening would <OOV> have been a <OOV> of

sample of generated text:

<S> upon my word sir cried elizabeth </S> <S> oh dear yes but <OOV> imagine it was not able to <OOV> that his character for it however at her entrance she was reading a <OOV> excuse for not having done thus much there is no <OOV> to be <OOV> near the <OOV> that you <OOV> were removed and the <OOV> of some one else elizabeth at that moment i may <OOV> his mind was not <OOV> a kind to make his <OOV> to charlotte’s first <OOV> is ever to see many young men too that she had

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.