Coder Social home page Coder Social logo

huytquoc / text-prediction-r Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vruizext/text-prediction-r

0.0 2.0 0.0 54 MB

Text Prediction app using N-Gram models, developed with R & shiny

Home Page: https://bik-tor.shinyapps.io/text-prediction/

R 13.44% HTML 86.56%

text-prediction-r's Introduction

Text Prediction using N-Gram models

The main goal of this project was to build a predictive text model from a text corpus. This was a "didactic" project, it was the assignment I had to complete for the Capstone Project of the Coursera Data Science Specialization. By the way, I got the max score from my student fellows in the peer review (:

The project consisted on several parts:

  1. Getting and cleaning the data
  2. Exploratory analysis
  3. Building n-gram model
  4. Building predictive model
  5. Testing and evaluating the model
  6. Building the UI (shiny app)

The shiny app was the final deliverable product of the project, which provides an interface to access the prediction algorithm that I've built.

How to run the code

In order to build the prediction model, all scripts in the '''scripts''' folder need to be sourced. Some of them, only have to be executed once, others might have to be executed more times, in order to tune the model and improve the accuracy of the predictions.

  • 0_get data.R: downloads the corpus, unzip the files, and save the english files in rds format.

  • 1_sample.R: sample the corpus, getting some configurable percentage for training, devtest and test.

  • 2_preprocess.R: clean the data to make it ready for the tokenization.

  • 3_unigrams.R: get unigrams distribution of the training corpus, in order to build a dictionary of limited size which will be used in the next steps.

  • 4_tetragrams.R: build the 4-gram model from the training data.

  • 5_mle-model.R: build a prediction model using MLE probabilities calculated from the 4-gram model.

  • 6_devtest.R: test the model with the devtest set, try several configurations for interpolation weights, in order to find which one minimizes the cross entropy.

  • 7_test.R: test the best set of coefficients obtained in the previous step with the test data set to validate the model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.