aspect_extraction.ipynb: Extracting aspects and plot the relation between number of aspects and helpfulness score.
Detect_Plagiarism_smaller_data.ipynb: Plagiarism detection and text quality measure for data sets which can load completely into memory. For other's storage to hard disk is required at some checkpoints (as done in Detect_Plagiarism_large_data.ipynb)
timestamp_helpfulness.ipynb: Analyse relation between timestamp and number of helpfulness votes for items.
deviation_plot.ipynb: Notebook for plotting helpfulness ratio as a function of deviation from mean review rating. Plots also for different satandard deviation data.
helpful_ml.ipynb: Machine learning approach to predict helpfulness (TODO)
data/data_frames:
[Item_Category].pkl - items with more than 5 helpfulness scores. Contains info about itemID, reviewerID, helpfulness, star rating
[Item_Category].time.pkl - same as above but contains timestamp instead of star rating.
plg/data_frames:
[Item_Category].pkl - items with more than 10 helpfulness scores. Contains all info as contained in original data set
plg/data_frames_fp:
[Item_Category].pkl - same as above but contains fingerprints computed instead of original review text
generate_data[*].sh: To process data in batches and store dataframes as pickle files. Calls generate_data[*].py.
generate_data[*].py: Process data to extract what is required for each batch.
https://drive.google.com/open?id=1Ho8UWx7---IsOklX08aljAuExqqG3joJ