jldbc / gutenberg Goto Github PK

View Code? Open in Web Editor NEW

28.0 7.0 12.0 13.48 MB

A content-based recommender system for books using the Project Gutenberg text corpus

Python 99.38% Shell 0.62%

recommender tfidf pyspark gutenberg knn

gutenberg's Introduction

Gutenberg

A Content Based Recommender using Project Gutenberg's entire database of books.

This system models literary style and taste using

Punctuation profiles
Part-of-speech tagging
TF-IDF clustering
Sentiment analysis
Other linguistic features

Sample Recommendations

How to Use It

Download the data
Run knn.py. The command prompt will ask you for the name of a book you enjoy.
You're done! Let us know if the recommendations fit your taste.

Steps for Replication

Data Collection

Run the shell script getBooks.sh to start pulling books from Project Gutenberg. Unzip the texts, then run strip_headers.py to get rid of the legal info in the headers and footers of all Project Gutenberg texts. This will take a while to run, and the entire text corpus may not be necessary (will be roughly 20gb in total). As an alternative, see the University of Michigan text corpus af 3000 books here

Feature Extraction

Run in terminal: preprocessing.py. This creates an output file with '|' delimeted features. Runs sentiment analysis, takes punctuation and part-of-speech profiles, gets other basic features.

Run TFIDF clustering

If set is small enough to run locally, run in terminal:

/[usr dir]/spark-submit /[usr dir]/TFIDF_Kmeans.py "/[usr dir]/[data set input]/" "/[usr dir]/[output dir]"

If using a larger text corpus, we recommend uploading the books to Amazon EC2 and running this script on Elastic MapReduce.

On AWS, this job outputs one file of cluster memberships for each slave node. The format is (BOOKID.txt, clusterID).

Form Recommendations

run knn.py

enter book title and author name exactly as entered in the data (it will ask you to try again if not)
see 15 nearest neighbors to the book in question

Next Steps / In Progress:

TFIDF-based search functionality (word_search.py)
Expanded feature extraction. (adding additional features to preprocessing.py)
Mongodb storage for sparse TFIDF data (populate_db.py)
A web app to make this easier to interact with

Team:
James LeDoux Drew Hoo
Aniket Saoji

Blog post: http://jamesrledoux.com

gutenberg's People

Contributors

Stargazers

Watchers

Forkers

aniketsaoji hoodr biroc sunnybingome thuzarwin littleso-so asyrofist himankckalal baichenjing python-repository-hub leenahunagund varin15

gutenberg's Issues

KMeans output

Add titles to k means output. Match book title to book id so the csv format is (title, id, cluster). This would be easier to interpret the results of.

Associate book ID to tittle in tfidf.py before write to CSV

Use dictionary to swap ID numbers for book titles. Books in directory are read in sequentially so just modify id_files.py so the return value is a dictionary. Move this function into tfidf.py to centralize tasks.

KMeans Scaling

Data needs to be scaled for K means. Word scores should have bounds [0,1].

Add word stemming to tfidf

Looking through its output, it actually could be important

Test

Just a test

Replication problem

when I tried to replicate in 3rd step tfidf code,it is running fine but I'm not able to find correct data in the output folder that was created by pyspark code.

I able to see only this in output folder
(u'file:/home/system5/Documents/gutenberg-master/output_POS.txt', 0)

TO DO

Fix normalizing:

normalize on dense matrix, not sparse
Stem before we normalize

XML parsing:

To get more features
year
author
genre

Finish Streaming:

Include necessary files from Gutenberg repo
edit tfidf_streaming.py to include streaming

Format Outputs:

run current tfidf.py locally to get output
use ALS from movieLensALS to cluster

Spring Cleaning

Some of these files might not be necessary anymore. Need to look closer and verify, but I think fred.txt is an out of date example output, featurizer is now baked into MoreEfficientPreprocessing, and get_sentiment.py [the first one] is inferior to the second.

Comma and backslash issue in book names

tf-idf script is splitting book titles on commas, and this is causing it to break. Quick fix from earlier was to remove commas from book titles, but this is not going to work as we start taking titles from online. The backslash character is also causing it to break.

e.g. "Lincoln's Gettysburg address, delivered in 1863" (one title) => "Lincoln's Gettysburg address" and "delivered in 1863" as two separate titles, where neither of these partial titles exist

Cosine similarity

See if we can do cosine similarity instead of euclidean distance for the clustering. Supposedly that can work better for text clustering.

Re-Introduce Stemming in TF-IDF

It's really just the stems we care about here I think. Might make a slight improvement in clustering accuracy.

Authors, Years

Get author. Get a guess at year of publication if possible (I believe the author's lifetime will be included in the metadata we pull from the website. The mean of this will be a good enough guess)

How can I get data csv files?

on Running TFIDF clustering code on preprocessed folder does it gives me the exact csv file with all columns and rows like cluster size values(small,medium.large) etc values just like that are in git repo data folder ?

Drop low tfidf scores

Insert something into tfidf that cuts everything below a certain threshold. Simplifies the k-means step and removes some of the noise from our data.

Where to get all-data.csv file?

I want to execute the knn code directly but I need not find the all-data.csv file in the repo.please share all-data file.

Create Sample Set

Create a sample set of known books to test the clustering. Pick clusters of books that you know are similar, and then run the code on them to test efficacy

"bar" character in get_sentiment output

Having trouble with the '|' character in the output of the sentiment analyzer. Pandas reads it at a delimiter since output.txt is bar delimited. Could we go without including these characters? They seem unimportant in 99.999% of literature.

"Sufficiently Large" IDF dataset

TFIDF is time expensive, so let's find a way make this more efficient. All frequencies are kept local at the end of first pass, then score is calculated. So it requires multiple passes of book data. Find a way to find a set sufficiently large enough that we can assume IDF, since word frequency across all 50,000 books might not actually affect things greatly. Challenge is to find a set representative of the whole.

NOTE: potentially just use full Michigan Dataset for IDF then run TFIDF on full Gutenberg using Michigan IDF