Coder Social home page Coder Social logo

langmodel's Introduction

langModel

Simple text prediction demonstration using R

Mark Rodighiero

April 26, 2015

Overview

This project illustrates a 'next word' text prediction technique based on a simple statistical model of language. In this model, the predicted 'next word' is the word with the highest probability of occurrence conditioned on the words that proceed it. Three source texts were used for estimating the conditional probabilities and for estimating the performance of the prediction algorithm. The same sources were used to build the final demonstration project, implemented in publicly accessible webpage at data-dancer.com/langModel.

What it does

This algorithm predicts the most likely word that will immediately follow an input word or phrase. Actually, a set of up to 10 words is returned in decreasing likelihood as predicted by each of the source texts and for all source texts combined. The probability of occurrence of each word in the set is illustrated graphically.

Source texts and data cleaning

The source texts comprise excerpts from on-line news sources, blogs, and twitter posts. These three sources vary in terms of vocabulary, writing style, and word statistics. The source texts were processed in the following way:

Common contractions and abbreviations were expanded to complete words Money, dates, URLs, e-mail addresses and other numbers are predictive in a generic sense. Therefore, these elements were converted to tags as , , , , and respectively. Recognizing that the words at the end of the sentence would not be good predictors for the first words of a following sentence, the source text files were reformatted as one sentence per line.

How the algorithm works

The texts were then tokenized by word boundaries into 2-, 3-, and 4-grams on a line by line (i.e, sentence by sentence) basis. The n-grams were split into "prefix" (first n-1 words in the n-gram) and "suffix" (last word in the n-gram). Finally, the relative frequencies of occurrence for each suffix for a given prefix in each document and for all three documents combined were calculated.

Prediction works as follows:

The last three words of an input phrase are used to index into the 4-gram token prefixes. The corresponding suffixes with the highest frequency of occurrence are returned as candidates. If the last three words do not match any 4-gram prefixes, then the last two words of the input phrase are used to index into the 3-gram token prefixes, and so on.

How to use the product

Using a web browser, navigate to data-dancer.com/langModel Enter a word or phrase in the text input box in the left side panel. After a few moments, the most likely next word is shown to the right of the text input box. In addition, a set of words will appear for each of the source documents along with a bargraph showing the estimated conditional probability for each of the predicted words. The label on the horizontal axis shows the final prefix that was used in the prediction.

langmodel's People

Contributors

markatango avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.