YelpReviewPrediction

By Terry Shih and InHo Rha

Data science project for CSCI 183. ~~Using the data from Yelp reviews to predict the stars it will give.~~ Changed project to topic modeling and extraction of categorical scores for the restaurants based on the extracted topics.

Steps to run our code:

Using: Python 2 and Jupyter Notebook Required packages: numpy, pandas, sklearn, matplotlib, gensim, nltk, pytagcloud (optional tool for creating word clouds from https://github.com/atizo/PyTagCloud)

Setup:

Clone https://github.com/edwardrha/YelpReviewPrediction
Download Yelp data from https://www.yelp.com/dataset/challenge and place the JSON files into the /dataset directory.
Run the preprocessing.py from inside the /src directory. This will create and save the processed restaurant review files into the /dataset directory. WARNING: This step requires very high amount of RAM(64GB or higher). We do not recommend this step to be run and instead use the processed JSON uploaded to Google Drive here: https://drive.google.com/drive/folders/1gYgWxNDK_78notWqKFuRiujawdkB640r?usp=sharing
Run the ClusterModel.py from inside the /src directory. This will create a CountVectorizer object, feature names object, train 6 LDA models for k=[10, 15, 20, 25, 30, 40] and pickle them into /models directory. It will also save the label predictions for the reviews into /dataset directory as txt files. NOTE: Training each LDA model takes around 30 minutes each per core. Time consuming.
Now the Main.ipynb in the main directory is ready to be run.

Main.ipynb: By using the models and data created from the Setup process, demonstrates how we can predict the category of a review and use it to give the categorical rating for a chosen restaurant.

Labeling.ipynb: Contains the codes we used to examine the reviews from a cluster so we can manually label them.

Slides.ipynb: Contains the codes and slides used to create our presentation. Run in terminal: jupyter nbconvert Slides.ipynb --to slides --post serve To open the presentation slides.

edwardrha / yelpreviewprediction Goto Github PK