Parallel Latent Factor Model in Recommender Systems

GPU Accelerated Matrix Factorization

github (recommend to read this): https://github.com/lx9t01/cuda_SVD

Project Description

The core idea of using matrix factorization in recommender System is to map user rating space to item space, via several implicit multi-dimensional singular value decomposition factor matrices to describe the relationship between user and items. Normally, the rating matrix between users and items is sparse, but we can train the decomposed matrix w.r.t the values we have, and predict the rating for those users who have not rated the item yet, therefore provide solid and favorable recommendation.

Program Details

This cuda accelerated matrix factorization project consists of a CPU control code and a GPU part of code. The GPU part has 3 kernels, they are responsible for parallelized

Matrix Multiplication
Computing the RMS error between model prediction and true result
Updating the P and Q matrix Data are streamed.

Output

Time statistics of CPU and GPU based code, and two CSV files which has a predicted full matrix of rating from users to items.

Installation

The program relies on Eigen library installed for CPU part of the code, and GPU based cuda library should be installed in default folder as well.

Usage

After make the executable file, to run the program, simply type

./cuda_SVD <path to training datafile>

to run the program in default settings, or type

./cuda_SVD <path to training datafile> <number of users> <number of items> <number of dimensions f>

to set the number of users and items manually.

The training file provided is u1.base this is a small portion of the netflix contest data file

Parameters in the program

parameter	value	comments
BATCH_SIZE	1000	defines batch size
PARSE_THROUGH	1	a toggle of whether use CPU to training iteratively (0) or one-time (1)
()	0.02	The threshold to determine when the result converges in iterative mode
blocks	64	number of blocks per grid
threadsPerBlock	64	number of threads per block
num_users	943	1-indexed number of users
num_items	1682	1-indexed number of items
num_f	30	dimension of latent factor
num_f	30	dimension of latent factor
gamma	0.001	step size, applies to both CPU and GPU code
lamda	0.0005	regularization term parameter, applies to both CPU and GPU code

History

*ORIGINAL METHOD: * The project in the beginning is designed to handle medium number of data, and training on the data iteratively until the trained model is converged.

However, this method applies to CPU code but not the GPU code. As the training goes on, we expect to shuffle the training data every time to mimic the random sampling of data. Doing so in GPU memory will cause unexpected mess and therefore The iterative training process only applies to CPU running. Copying the data back to CPU, shuffle and copy back will greatly undermine performance.

The CPU iterative running results are as follows:

num_users = 943; num_items = 1682; num_f = 30;

const float gamma = 0.01; const float lamda = 0.005;

the last few deltas are listed below: delta_new: -9.74536e-05 stop condition: 0.216542 delta_new: 0.000672281 stop condition: -6.89847 delta_new: 5.57303e-06 (STOPPED here)

Total time to run classify on CPU: 10385.511719 (s) After training, the training error RMS: 0.344782, which means in average, the trained model of SVD will geenrate a rating matrix that has a difference of 0.35 stars compared with existing data.

It took Mako CPU almost 3 hours to train.

*UPDATED METHOD: * While iterative training could yield an accurate result in CPU demo, it’s not suitable for GPU acceleration for several reasons:

GPUs are not very good at shuffling the training data (at least I am unable to perform this in any kernel)
In most real-world cases, data are in huge amount. Therefore, it makes sense for GPU or CPU training process to just parse through the data and finish the training, because they are likely to be converged in that case.
I expand the training data by simply copying the existing data several more times and concat them in the bottom of file, and it helps the CPU and GPU code to converge.

Results

Note:

because GPU training is updated inside the kernel, I did not print out RMS error every time the P and Q are updated, but just print out them in the end. They are similar to CPU RMS error.
Because the training is not complete, the cvs files containing the rating data are not accurate enough, but can only provide a trend of rating. To obtain real rating recommendation, we need more data to training more times.

Original non-expanded training file: (100000 ratings) Total time to run decomposition on CPU: 1.524624 (s) Total time to run decomposition on GPU: 0.496859 (s) RMS CPU: 3.69954 GPU RMS: 3.69957
Expanded training file: (560000 ratings) Total time to run decomposition on CPU: 8.359917 (s) Total time to run decomposition on GPU: 0.896845 (s) RMS CPU: 1.28598 GPU RMS: 1.31158

As the training data got big, the submission attachment can only contain this number of data…

Performance Analysis

The GPU has significant time advantage over CPU (3x - 10x);
Because the training is not completed compared with iterative model, The RMS error is relatively big.
If the training data set got bigger, the accuracy will improve significantly.

Resources

Netflix Prize Data Set http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
Matrix Factorization Techniques For Recommender Systems https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf

ohmyagnes / cuda_svd Goto Github PK

cuda_svd's Introduction

Parallel Latent Factor Model in Recommender Systems

GPU Accelerated Matrix Factorization

github (recommend to read this): https://github.com/lx9t01/cuda_SVD

Project Description

Program Details

Output

Installation

Usage

History

Results

Performance Analysis

Resources

cuda_svd's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent