Coder Social home page Coder Social logo

dselivanov / rsparse Goto Github PK

View Code? Open in Web Editor NEW
167.0 17.0 31.0 1.1 MB

Fast and accurate machine learning on sparse matrices - matrix factorizations, regression, classification, top-N recommendations.

Home Page: https://www.slideshare.net/DmitriySelivanov/matrix-factorizations-for-recommender-systems

R 51.54% C++ 48.00% Shell 0.13% M4 0.26% Makefile 0.07%
collaborative-filtering recommender-system r matrix-factorization matrix-completion factorization-machines svd sparse-matrices

rsparse's Introduction

rsparse

R build status codecov License Project Status

rsparse is an R package for statistical learning primarily on sparse matrices - matrix factorizations, factorization machines, out-of-core regression. Many of the implemented algorithms are particularly useful for recommender systems and NLP.

We've paid some attention to the implementation details - we try to avoid data copies, utilize multiple threads via OpenMP and use SIMD where appropriate. Package allows to work on datasets with millions of rows and millions of columns.

Features

Classification/Regression

  1. Follow the proximally-regularized leader which allows to solve very large linear/logistic regression problems with elastic-net penalty. Solver uses stochastic gradient descent with adaptive learning rates (so can be used for online learning - not necessary to load all data to RAM). See Ad Click Prediction: a View from the Trenches for more examples.
    • Only logistic regerssion implemented at the moment
    • Native format for matrices is CSR - Matrix::RsparseMatrix. However common R Matrix::CsparseMatrix (dgCMatrix) will be converted automatically.
  2. Factorization Machines supervised learning algorithm which learns second order polynomial interactions in a factorized way. We provide highly optimized SIMD accelerated implementation.

Matrix Factorizations

  1. Vanilla Maximum Margin Matrix Factorization - classic approch for "rating" prediction. See WRMF class and constructor option feedback = "explicit". Original paper which indroduced MMMF could be found here.
  2. Weighted Regularized Matrix Factorization (WRMF) from Collaborative Filtering for Implicit Feedback Datasets. See WRMF class and constructor option feedback = "implicit". We provide 2 solvers:
    1. Exact based on Cholesky Factorization
    2. Approximated based on fixed number of steps of Conjugate Gradient. See details in Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering and Faster Implicit Matrix Factorization.
  3. Linear-Flow from Practical Linear Models for Large-Scale One-Class Collaborative Filtering. Algorithm looks for factorized low-rank item-item similarity matrix (in some sense it is similar to SLIM)
  4. Fast Truncated SVD and Truncated Soft-SVD via Alternating Least Squares as described in Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares. Works for both sparse and dense matrices. Works on float matrices as well! For certain problems may be even faster than irlba package.
  5. Soft-Impute via fast Alternating Least Squares as described in Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.
    • with a solution in SVD form
  6. GloVe as described in GloVe: Global Vectors for Word Representation.
    • This is usually used to train word embeddings, but actually also very useful for recommender systems.
  7. Matrix scaling as descibed in EigenRec: Generalizing PureSVD for Effective and Efficient Top-N Recommendations

Note: the optimized matrix operations which rparse used to offer have been moved to a separate package

Installation

Most of the algorithms benefit from OpenMP and many of them could utilize high-performance implementations of BLAS. If you want to make the maximum out of this package, please read the section below carefully.

It is recommended to:

  1. Use high-performance BLAS (such as OpenBLAS, MKL, Apple Accelerate).
  2. Add proper compiler optimizations in your ~/.R/Makevars. For example on recent processors (with AVX support) and compiler with OpenMP support, the following lines could be a good option:
CXX11FLAGS += -O3 -march=native -fopenmp
CXXFLAGS   += -O3 -march=native -fopenmp

Mac OS

If you are on Mac follow the instructions at https://mac.r-project.org/openmp/. After clang configuration, additionally put a PKG_CXXFLAGS += -DARMA_USE_OPENMP line in your ~/.R/Makevars. After that, install rsparse in the usual way.

Also we recommend to use vecLib - Apple’s implementations of BLAS.

ln -sf  /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib /Library/Frameworks/R.framework/Resources/lib/libRblas.dylib

Linux

On Linux, it's enough to just create this file if it doesn't exist (~/.R/Makevars).

If using OpenBLAS, it is highly recommended to use the openmp variant rather than the pthreads variant. On Linux, it is usually available as a separate package in typical distribution package managers (e.g. for Debian, it can be obtained by installing libopenblas-openmp-dev, which is not the default version), and if there are multiple BLASes installed, can be set as the default through the Debian alternatives system - which can also be used for MKL.

Windows

By default, R for Windows comes with unoptimized BLAS and LAPACK libraries, and rsparse will prefer using Armadillo's replacements instead. In order to use BLAS, install rsparse from source (not from CRAN), removing the option -DARMA_DONT_USE_BLAS from src/Makevars.win and ideally adding -march=native (under PKG_CXXFLAGS). See this tutorial for instructions on getting R for Windows to use OpenBLAS. Alternatively, Microsoft's MRAN distribution for Windows comes with MKL.

Materials

Note that syntax is these posts/slides is not up to date since package was under active development

  1. Slides from DataFest Tbilisi(2017-11-16)
  2. Introduction to matrix factorization with Weighted-ALS algorithm - collaborative filtering for implicit feedback datasets.
  3. Music recommendations using LastFM-360K dataset
    • evaluation metrics for ranking
    • setting up proper cross-validation
    • possible issues with nested parallelism and thread contention
    • making recommendations for new users
    • complimentary item-to-item recommendations
  4. Benchmark against other good implementations

Here is example of rsparse::WRMF on lastfm360k dataset in comparison with other good implementations:

API

We follow mlapi conventions.

Release and configure

Making release

Don't forget to add DARMA_NO_DEBUG to PKG_CXXFLAGS to skip bound checks (this has significant impact on NNLS solver)

PKG_CXXFLAGS = ... -DARMA_NO_DEBUG

Configure

Generate configure:

autoconf configure.ac > configure && chmod +x configure

rsparse's People

Contributors

aliciaschep avatar david-cortes avatar dselivanov avatar gsenseless avatar snoweye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rsparse's Issues

Instability in rsparse::WRMF convergence and loss function

Hey Rexy,

Your WRMF function looks really promising in terms of fast performance, but I've noticed some odd happenings.

First off, this is my use case:

RunNMF <- function(input, rank = NULL, max.iter = 100, rel.tol = 1e-4){
    require(rsparse)
    require(Matrix)
    input <- Matrix(input, sparse = TRUE)
    model <- WRMF$new(rank = rank, feedback = "explicit", non_negative = TRUE, solver = "cholesky")
    result <- model$fit_transform(input, n_iter = max.iter, convergence_tol = rel.tol)
    return(result)
}

1. WRMF converges when error loss increases, not when it begins to decrease more slowly
I noticed that your function has some instability that permits the error loss to increase from one iteration to the next. For example, here's output showing how this is the case:

> n <- RunNMF(A, rank = 10)
INFO  [08:59:26.158] starting factorization with 6 threads 
INFO  [08:59:26.229] iter 1 loss = 51.5322  
INFO  [08:59:26.302] iter 2 loss = 4.6352  
INFO  [08:59:26.373] iter 3 loss = 2.5364  
INFO  [08:59:26.441] iter 4 loss = 2.6215  
INFO  [08:59:26.442] Converged after 4 iterations 

It "Converged" at the fourth iteration, where the error loss went up from 2.5364 to 2.6215. Also, I'm not sure how you're measuring error loss. Typically it's calculated as 1- (loss.prev.iter - loss.current.iter)/loss.prev.iter.

This observation implies there is some inherent instability in the algorithm.

2. Instability in WRMF is caused by consideration of only positive values
Here's your solver function, rewritten to highlight the specific use case above:

sparse.nnls <- function(A, X) {
    res <- list()
    for (i in 1:ncol(A)) {
        p1 <- A@p[[i]] # index in X corresponding to start of column 1
        p2 <- A@p[[i + 1]] # index in X corresponding to start of column 2, where X is only positive numbers
        j <- p1 + 1:(p2 - p1) # all non-zero indices in A@x in A[,i]
        A_pos <- A@x[j] # all positive values in A[,i]
        ind_pos <- A@i[j] + 1 # all positive indices in A
        X_pos <- X[, ind_pos] # corresponding values in X
        res[[i]] <- nnls(tcrossprod(X_pos), tcrossprod(X_pos,t(A_pos)))$x # Fortran77 implementation of fnnls
    }
    do.call(cbind,res)
}

Note that the indices of positive values in A are pulled from every column, and X is then subset by those indices. NNLS is then only run on the positive indices. This is the problem. Running NNLS on only positive indices challenges the ability of NMF to infer missing values. For data where there are no missing values, this approach makes perfect sense and is awesomely fast. However, if your algorithm really is not meant to handle missing values, then that should be in big bold letters at the top of the package documentation so the theoretical implications are understood by all users immediately.

Indeed, if I rework the solver function above to a dense representation of the data:

dense.nnls <- function(A, X) {
    res <- list()
    for (i in 1:ncol(A)) {
        res[[i]] <- nnls(tcrossprod(X), tcrossprod(X,t(A[,i])))$x # Fortran77 implementation of fnnls
    }
    do.call(cbind,res)
}

I no longer get instability in the error loss function.

Sorry I haven't taken the time to produce reproducible examples. I just wanted to raise this as an issue and see if you have any thoughts on the theoretical underpinnings. Maybe I'm wrong, but I'm definitely seeing something I don't see in other NMF algorithms. But... WRMF is so fast. It would be nice to make it work.

How to install reco R package ?

Hello

I am unable to install this package. Also, I cannot find any documentation about how to install.
I am getting the following error:

install.packages("reco")
Installing package into ‘/home/manish/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘reco’ is not available (for R version 3.4.1)

Please advise.

Best,
Manish

WRMF user and item biases for implicit feedback data

@dselivanov I’m not so sure it’s something desirable to have actually. I tried playing with centering and biases with implicit-feedback data, and I see that adding user biases usually gives a very small lift in metrics like HR@5, but item biases makes them much worse.

You can play with cmfrec (version from git, the one from CRAN has bugs for this use-case) like this with e.g. the lastFM data or similar, which would fit the same model as WRMF with feedback="implicit":

library(cmfrec)
Xvalues <- Xcoo@x
Xcoo@x <- rep(1, length(Xcoo@x))
model <- CMF(Xcoo, weight=Xvalues, NA_as_zero=TRUE,
             center=TRUE, user_bias=TRUE, item_bias=TRUE)

Originally posted by @david-cortes in #44 (comment)

Classification Using Factorization Machines

Thank you for this great package. Somehow, FM is not returning the probabilities for a classification problem and I can't find any parameter to force that as well. Please have a look at my code below and let me know if I am missing something

fm <- FactorizationMachine$new(learning_rate_w = learning_rate, rank = rank, lambda_w = lambda1, lambda_v = lambda2, task = "classification", intercept = TRUE, learning_rate_v = learning_rate)
fm$fit(trainSparse, target, n_iter = iter)
pred <- fm$predict(testSparse)

Code runs without any issues but returns class.

Optimization objective under explicit feedback

In WRMF, there is the following loss function:

// [[Rcpp::export]]
double als_loss_explicit(const Rcpp::S4 &m_csc_r, arma::mat& X, arma::mat& Y, double lambda, unsigned n_threads) {
  dMappedCSC mat = extract_mapped_csc(m_csc_r);
  size_t nc = mat.n_cols;
  double loss = 0;
  #ifdef _OPENMP
  #pragma omp parallel for num_threads(n_threads) schedule(dynamic, GRAIN_SIZE) reduction(+:loss)
  #endif
  for(size_t i = 0; i < nc; i++) {
    int p1 = mat.col_ptrs[i];
    int p2 = mat.col_ptrs[i + 1];
    for(int pp = p1; pp < p2; pp++) {
      size_t ind = mat.row_indices[pp];
      double diff = mat.values[pp] - as_scalar(Y.col(i).t() * X.col(ind));
      loss += diff * diff;
    }
  }
  if(lambda > 0)
    loss += lambda * (accu(square(X)) + accu(square(Y)));
  return loss / mat.nnz;
}

The function calculates squared loss over non-missing entries.

But then there is the following solver:

solver_explicit_feedback = function(R, X) {
      # FIXME - consider to use private XtX
      XtX = tcrossprod(X) + diag(x = private$lambda, nrow = private$rank, ncol = private$rank)
      solve(XtX, as(X %*% R, "matrix"))
    }

The solver finds the solution to a problem in which the missing entries are zeros (i.e. they count towards the loss).

Should one of them be fixed?

When evaluated in terms or RMSE, the solver with missing-as-zero tends to lead to very poor results (e.g. fixing k=40 and testing on the ML10M, I get an RMSE of ~ 0.95-0.97, compared to an RMSE 0.78-0.82 with software that ignores missing entries).

Embarrassingly Shallow Autoencoders for Sparse Data

https://arxiv.org/abs/1905.03375 - should be trivial to implement

Combining simple elements from the literature, we define a linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems. We show that its training objective has a closed-form solution, and discuss the resulting conceptual insights. Surprisingly, this simple model achieves better ranking accuracy than various state-of-the-art collaborative-filtering approaches, including deep non-linear models, on most of the publicly available data-sets used in our experiments.

Configure script doesn't pick OpenMP

In a clean debian linux install, the configure script will not end up adding openmp linkage flags to Makevars. I guess there is something going wrong in the testing for R's flags - I get this when I run the command on my setup:

R CMD config --ldflags
-Wl,--export-dynamic -fopenmp -Wl,-z,relro -L/usr/lib/R/lib -lR -lpcre2-8 -llzma -lbz2 -lz -ltirpc -lrt -ldl -lm -licuuc -licui18n

But the configure script still doesn't end up picking -fopenmp when it generates the final Makevars file.

Nowadays it's BTW safe to just use $(SHLIB_OPENMP_CXXFLAGS) directly in the Makevars.in file without having to test for it in the configure script.

WRMF vignette

  • explain confidence
  • solvers - cholesky, conjugate gradient
  • checking convergence
  • excluding items from prediction

Q; Python wrappers?

Sorry for asking, but could this library include a Python wrapper for trying the new algorithms?

user and item biases in WRMF and explicit feedback

As of 35247b4

without biases

library(rsparse)
library(lgr)
lg = get_logger('rsparse')
lg$set_threshold('debug')
data('movielens100k')
options("rsparse_omp_threads" = 1)

train = movielens100k

set.seed(1)
model = WRMF$new(rank = 10,  lambda = 1, feedback  = 'explicit', solver = 'cholesky', with_bias = FALSE)
user_emb = model$fit_transform(train, n_iter = 10, convergence_tol = -1)

INFO [23:09:40.158] starting factorization with 1 threads
INFO [23:09:40.268] iter 1 loss = 4.4257
INFO [23:09:40.302] iter 2 loss = 1.2200
INFO [23:09:40.332] iter 3 loss = 0.8617
INFO [23:09:40.361] iter 4 loss = 0.7752
INFO [23:09:40.391] iter 5 loss = 0.7398
INFO [23:09:40.420] iter 6 loss = 0.7191
INFO [23:09:40.456] iter 7 loss = 0.7046
INFO [23:09:40.488] iter 8 loss = 0.6935
INFO [23:09:40.522] iter 9 loss = 0.6845
INFO [23:09:40.555] iter 10 loss = 0.6769

with biases

set.seed(1)
model = WRMF$new(rank = 10,  lambda = 1, feedback  = 'explicit', solver = 'cholesky', with_bias = TRUE)
user_emb = model$fit_transform(train, n_iter = 10, convergence_tol = -1)

INFO [23:10:06.605] starting factorization with 1 threads
INFO [23:10:06.637] iter 1 loss = 0.8411
INFO [23:10:06.671] iter 2 loss = 0.6251
INFO [23:10:06.704] iter 3 loss = 0.5950
INFO [23:10:06.736] iter 4 loss = 0.5820
INFO [23:10:06.769] iter 5 loss = 0.5751
INFO [23:10:06.805] iter 6 loss = 0.5712
INFO [23:10:06.840] iter 7 loss = 0.5688
INFO [23:10:06.875] iter 8 loss = 0.5673
INFO [23:10:06.916] iter 9 loss = 0.5663
INFO [23:10:06.951] iter 10 loss = 0.5657

cc @david-cortes

future float R version dependency

Just a heads up, but in v3.6.0 of R, they are changing some Make macros that affect float. I believe the expectation is that to comply, I will need to submit an update soon using the new macros, and an R (>= 3.6.0) dependency.

I try to support pretty old versions of R so I'm not thrilled about this. Unfortunately I don't think there's any way avoid it. Hopefully this doesn't impact you too much.

Integer comparison compilation warning

Issue: compilation warning

Hi there and thanks for the interesting package. I have been using it quite a bit and wouldn't mind helping out if I can. I am getting this compilation warning:

utils.cpp:43:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(q.size() < k){
^

githubinstall::gh_install_packages("reco", ask = FALSE, force=TRUE)

Show in New WindowClear OutputExpand/Collapse Output
Downloading GitHub repo dselivanov/reco@master
from URL https://api.github.com/repos/dselivanov/reco/zipball/master
Installing reco
"C:/PROGRA1/MICROS1/ROPEN1/R-341.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
"C:/Users/****/AppData/Local/Temp/RtmpyKsknU/devtools22d8bd227b1/dselivanov-reco-4058c0d" --library="C:/R_Packages"
--install-tests

  • installing source package 'reco' ...
    ** libs
    c:/Rtools/mingw_64/bin/g++ -m64 -std=gnu++11 -I"C:/PROGRA1/MICROS1/ROPEN1/R-341.0/include" -DNDEBUG -I"C:/R_Packages/Rcpp/include" -I"C:/R_Packages/RcppArmadillo/include" -I"C:/swarm/workspace/External-R-3.3.3/vendor/extsoft/include" -fopenmp -DARMA_64BIT_WORD -DARMA_DONT_USE_OPENMP -O2 -Wall -mtune=native -c RcppExports.cpp -o RcppExports.o
    c:/Rtools/mingw_64/bin/g++ -m64 -std=gnu++11 -I"C:/PROGRA1/MICROS1/ROPEN1/R-341.0/include" -DNDEBUG -I"C:/R_Packages/Rcpp/include" -I"C:/R_Packages/RcppArmadillo/include" -I"C:/swarm/workspace/External-R-3.3.3/vendor/extsoft/include" -fopenmp -DARMA_64BIT_WORD -DARMA_DONT_USE_OPENMP -O2 -Wall -mtune=native -c als_implicit_core_solver.cpp -o als_implicit_core_solver.o
    c:/Rtools/mingw_64/bin/g++ -m64 -std=gnu++11 -I"C:/PROGRA1/MICROS1/ROPEN1/R-341.0/include" -DNDEBUG -I"C:/R_Packages/Rcpp/include" -I"C:/R_Packages/RcppArmadillo/include" -I"C:/swarm/workspace/External-R-3.3.3/vendor/extsoft/include" -fopenmp -DARMA_64BIT_WORD -DARMA_DONT_USE_OPENMP -O2 -Wall -mtune=native -c utils.cpp -o utils.o
    utils.cpp: In function 'Rcpp::IntegerMatrix dotprod_top_k(const mat&, const mat&, int, int, Rcpp::Nullable<const arma::SpMat >&)':
    utils.cpp:43:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    if(q.size() < k){
    ^
    c:/Rtools/mingw_64/bin/g++ -m64 -shared -s -static-libgcc -o reco.dll tmp.def RcppExports.o als_implicit_core_solver.o utils.o -fopenmp -fopenmp -LC:/PROGRA1/MICROS1/ROPEN1/R-341.0/bin/x64 -lRlapack -LC:/PROGRA1/MICROS1/ROPEN1/R-341.0/bin/x64 -lRblas -lgfortran -lm -lquadmath -LC:/swarm/workspace/External-R-3.3.3/vendor/extsoft/lib/x64 -LC:/swarm/workspace/External-R-3.3.3/vendor/extsoft/lib -LC:/PROGRA1/MICROS1/ROPEN1/R-341.0/bin/x64 -lR
    installing to C:/R_Packages/reco/libs/x64
    ** R
    ** byte-compile and prepare package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
  • DONE (reco)

SessionInfo

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
[1] httr_1.2.1 compiler_3.4.0 R6_2.2.0 RevoUtils_10.0.4 tools_3.4.0 withr_2.0.0
[7] yaml_2.1.14 curl_2.6 githubinstall_0.2.1 memoise_1.1.0 knitr_1.15.1 data.table_1.10.4
[13] git2r_0.18.0 jsonlite_1.4 digest_0.6.12 devtools_1.12.0

Is there any testing that I could help out with to run this warning down or is this a 'quick' fix by recasting the variable type to match? Also, I am happy to offer some typo/spelling fixes in documentation and code comments if you would be willing to accept them. Cheers!

Huge performance degradataion for WRMF

In version 0.5.0 from CRAN (installed with a modified Makevars.in to force OMP linkage), there is a huge slowdown in WRMF with implicit feedback compared to earlier versions.

For example, if I try running it on the LastFM-360K dataset with this configuration + 15 iterations with no early stopping:

WRMF$new(feedback="implicit", rank=50, lambda=5,
         solver="conjugate_gradient",
         with_global_bias=FALSE, with_user_item_bias=FALSE)

And then compare different libraries with these same settings, I get the following times:

  • rsparse: 39.18s
  • cmfrec: 29.52s
  • implicit: 29.0s

Whereas in earlier versions the time was somewhere between implicit and cmfrec. The Cholesky solver is also affected by this slowdown.

I haven't been able to pinpoint what is causing the slowdown. Tried adding extra armadillo defines like DARMA_DONT_USE_WRAPPER, DARMA_USE_BLAS, DARMA_USE_LAPACK, DARMA_USE_OPENMP, but it didn't make a difference.

Expanded GLM offering

From FTRL, I saw that Poisson regression is on your to-do list. As an alternative, I'd like to suggest that you consider an implementation of the Tweedie family which allows, through tuning parameters, the Gaussian, Poisson, Gamma and Inverse Gaussian families, as well as the gradients between.

input a list of items in not_recommend?

First of all, thanks very much for this package!

In my use case, I have to check the stock and see which products are available for recommendations. So usually I will have a list of items that I can't recommend for all users - This list may be very large.

I'm creating a sparse matrix with a lot of columns full of ones, and this is very slow. Would it be possible to pass a vector of items for the not_recommend argument?

Thanks,
Daniel

Simple shiny demo based on lastfm-360k

Recommendation types

  • similar artists
  • recommendations for a given list of artists (considering it as history of listenings)

Methods

  • PureSVD
  • WRMF (implicit feedback)
  • MMMF (explicit feedback)
  • LinearFlow with svd initialization
  • LinearFlow with soft-impute initialization

How to use item_exclude

Hi!

Thanks for a really impressive package.

In my world there are two common scenarios when building recommendation systems. You either want to recommend products that a customer has never liked (or bought) from your whole catalogue or you want to recommend products from a subset of the catalogue, e.g. products that are discounted. Most implementations of collaborative filtering focus on the first scenario. My question is how to use the item_exclude to tackle the second scenario. This is somewhat related to a previous issue

For instance, say that we have 60 artists whose album are on sale in the lastfm dataset that we want to recommend.

Example code from: http://dsnotes.com/post/2017-06-28-matrix-factorization-for-recommender-systems-part-2/

set.seed(1)
library(data.table)
raw_data = fread("lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv",
                 showProgress = FALSE, encoding = "UTF-8",
                 quote = "")
setnames(raw_data, c("user_id", "artist_id", "artist_name", "number_plays"))

user_encoding <- raw_data[, .(uid = .GRP), keyby = user_id]

item_encoding = raw_data[, .(iid = .GRP, artist_name = artist_name[[1]]), keyby = artist_id]

Here I'll sample 60 artists "on sale" and create a table of items to exclude from the predictions.

on_sale <- sample(item_encoding$artist_name, 60)
items_exclude <- item_encoding[!(artist_name %in% on_sale)]
on_sale
 [1] "the bridge"                          "snippet"                            
 [3] "v.o.s."                              "the ullulators"                     
 [5] "藤井フミヤ"                          "erika jo"                           
 [7] "gore"                                "amaral"                             
 [9] "ceili rain"                          "schwarze puppen"                    
[11] "dan wheeler"                         "yuki suzuki"                        
[13] "krymplings"                          "olivia ruiz"                        
[15] "edgewater"                           "karl johan"                         
[17] "pamela z"                            "global spirit"                      
[19] "damien youth"                        "fires of babylon"                   
[21] "comic relief"                        "emmanuel horvilleur"                
[23] "sandra stephens"                     "cyclopede"                          
[25] "Михаил Боярский"                     "the great eastern"                  
[27] "radwimps"                            "papa austin with the great peso"    
[29] "phasen"                              "mari menari"                        
[31] "Холодне Сонце"                       "laura story"                        
[33] "mugwart"                             "errand boy"                         
[35] "erlend krauser"                      "göran fristorp"                     
[37] "mousse t & emma lanford"             "dj vlad & dirty harry"              
[39] "denim"                               "thomas leer & robert rental"        
[41] "the underdog project vs the sunclub" "sense club"                         
[43] "mary kiani"                          "ladies night"                       
[45] "tresk"                               "the peddlers"                       
[47] "quatuor ysaÿe"                       "brandhärd"                          
[49] "bittor aiape"                        "prince francis"                     
[51] "alex klaasen & martine sandifort"    "peppermint petty"                   
[53] "dave ramsey"                         "müşfik kenter"                      
[55] "shima & shikou duo"                  "jimmy j & cru-l-t"                  
[57] "ankarali yasemin"                    "marian opania"                      
[59] "madita"                              "zoltar"   

Below are some data manipulation to put data in a sparse matrix.

library(Matrix)
raw_data[, artist_name := NULL]
dt = user_encoding[raw_data, .(artist_id, uid, number_plays), on = .(user_id = user_id)]
dt = item_encoding[dt, .(iid, uid, number_plays), on = .(artist_id = artist_id)]
rm(raw_data)

X = sparseMatrix(i = dt$uid, j = dt$iid, x = dt$number_plays, 
                 dimnames = list(user_encoding$user_id, item_encoding$artist_name))
N_CV = 1000L
cv_uid = sample(nrow(user_encoding), N_CV)

X_train = X[-cv_uid, ]
X_cv = X[cv_uid, ]
rm(X)

Here we fit the model.

make_confidence = function(x, alpha) {
  x_confidence = x
  stopifnot(inherits(x, "sparseMatrix"))
  x_confidence@x = 1 + alpha * x@x
  x_confidence
}
library(rsparse)
model = WRMF$new(x_train = x_train, x_cv = X_cv, rank = 8, feedback = "implicit")
set.seed(1)
alpha = 0.01
X_train_conf = make_confidence(X_train, alpha)
X_cv_history_conf = make_confidence(X_cv_history, alpha)
user_embeddings = model$fit_transform(X_train_conf, n_iter = 10L, n_threads = 8)
new_user_embeddings = model$transform(X_cv_history_conf)

Now, I want to recommend only the artists that are on sale, so I pass the excluded artists to the items_exclude argument.

new_user_1 = X_cv[1:1, , drop = FALSE]
new_user_predictions = model$predict(new_user_1, k = 60, items_exclude = items_exclude$artist_name)

head(data.frame(segmentid = t(attr(new_user_predictions, "ids"))))
  e9dc15dfabe0bdac615143623e1fe83ba4e2daa5
1                                   björk
2                  einstürzende neubauten
3                                     isis
4                        frédéric chopin
5                               sigur rós
6                        ë\u008f™ë°©ì‹ 기

However, these recommendations are not the ones on sale?

I suppose this would be clearer for me with a vignette, that I can see is on its way, however, in the meanwhile, how should one use the item_exclude argument?

Furthermore, say we want to maximize the recommendations here, i.e. put k = 60, would that work for multiple users?

Development version failing compilation with devtools::install_github("rexyai/rsparse")

I've tried installing rsparse with devtools:

library(devtools)
install_github("rexyai/rsparse")

This returns the following error (tail of the system log result):

"C:/Rtools/mingw32/bin/"g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-40~1.2/include" -DNDEBUG  -I'C:/Users/zacha/Documents/R/win-library/4.0/Rcpp/include' -I'C:/Users/zacha/Documents/R/win-library/4.0/RcppArmadillo/include'     -I../inst/include/ -fopenmp -DARMA_32BIT_WORD -DARMA_DONT_USE_BLAS -DARMA_NO_DEBUG   -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign -c RcppExports.cpp -o RcppExports.o
RcppExports.cpp:396:73: error: 'uint' has not been declared
 arma::Mat<double> c_nnls_double(const arma::mat& x, const arma::mat& y, uint max_iter, double rel_tol);
                                                                         ^~~~
RcppExports.cpp: In function 'SEXPREC* _rsparse_c_nnls_double(SEXP, SEXP, SEXP, SEXP)':
RcppExports.cpp:402:36: error: 'uint' was not declared in this scope
     Rcpp::traits::input_parameter< uint >::type max_iter(max_iterSEXP);
                                    ^~~~
RcppExports.cpp:402:36: note: suggested alternative: 'Sint'
     Rcpp::traits::input_parameter< uint >::type max_iter(max_iterSEXP);
                                    ^~~~
                                    Sint
RcppExports.cpp:402:41: error: template argument 1 is invalid
     Rcpp::traits::input_parameter< uint >::type max_iter(max_iterSEXP);
                                         ^
RcppExports.cpp:402:49: error: expected initializer before 'max_iter'
     Rcpp::traits::input_parameter< uint >::type max_iter(max_iterSEXP);
                                                 ^~~~~~~~
RcppExports.cpp:404:54: error: 'max_iter' was not declared in this scope
     rcpp_result_gen = Rcpp::wrap(c_nnls_double(x, y, max_iter, rel_tol));
                                                      ^~~~~~~~
RcppExports.cpp:404:54: note: suggested alternative: 'max_iterSEXP'
     rcpp_result_gen = Rcpp::wrap(c_nnls_double(x, y, max_iter, rel_tol));
                                                      ^~~~~~~~
                                                      max_iterSEXP
make: *** [C:/PROGRA~1/R/R-40~1.2/etc/i386/Makeconf:229: RcppExports.o] Error 1
ERROR: compilation failed for package 'rsparse'
* removing 'C:/Users/zacha/Documents/R/win-library/4.0/rsparse'
Error: Failed to install 'rsparse' from GitHub:
  (converted from warning) installation of package ‘C:/Users/zacha/AppData/Local/Temp/RtmpsH0MQU/file4b1017134c0/rsparse_0.5.0.tar.gz’ had non-zero exit status

My sessionInfo():

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.3.1 usethis_1.6.1 

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11   magrittr_1.5      pkgload_1.1.0     lattice_0.20-41   R6_2.5.0         
 [6] rlang_0.4.7       fansi_0.4.1       tools_4.0.2       pkgbuild_1.1.0    grid_4.0.2       
[11] sessioninfo_1.1.1 cli_2.0.2         withr_2.2.0       remotes_2.2.0     ellipsis_0.3.1   
[16] yaml_2.2.1        assertthat_0.2.1  digest_0.6.25     rprojroot_1.3-2   crayon_1.3.4     
[21] processx_3.4.3    Matrix_1.2-18     callr_3.4.3       fs_1.4.2          ps_1.3.3         
[26] curl_4.3          testthat_2.3.2    memoise_1.1.0     glue_1.4.2        compiler_4.0.2   
[31] desc_1.2.0        backports_1.1.8   prettyunits_1.1.1

If I understand the issue correctly, the uint class used in RcppExports is throwing the issue. I can't get the RcppExports file to compile on it's own loading it directly via source. However, if I change uint to int it works fine.

Thanks again for your work.

Error loading rsparse after install

Hi all,

I've been receiving this error when trying to load rsparse

  1. Have tried installing from CRAN and it appears to install fine, but when loading with library(rsparse)....

> library(rsparse) Error: package or namespace load failed for ‘rsparse’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rsparse/libs/rsparse.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rsparse/libs/rsparse.so, 6): Library not loaded: @rpath/Volumes/SSD-Data/Builds/R-dev-web/QA/Simon/packages/el-capitan-x86_64/Rlib/3.6/float/libs/float.so Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rsparse/libs/rsparse.so Reason: image not found

I have confirmed that the float package has been installed (both doing so through CRAN, and have tried through devtools).

When trying to install rsparse through github, I get the following error:

`* installing source package ‘rsparse’ ...
** using staged installation
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... configure: error: in '/private/var/folders/wg/zm1l13hs4_j80f2yndrd1z_w0000gp/T/Rtmpu6XVG0/R.INSTALLac747f212ad4/rsparse':
configure: error: cannot run C++ compiled programs.
If you meant to cross compile, use '--host'.
See 'config.log' for more details
ERROR: configuration failed for package ‘rsparse’

  • removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rsparse’
    Error in i.p(...) :
    (converted from warning) installation of package ‘/var/folders/wg/zm1l13hs4_j80f2yndrd1z_w0000gp/T//RtmpVN3D91/fileac29105bb95d/rsparse_0.3.3.2.tar.gz’ had non-zero exit status`

Usually with above errors this has come down to having to reinstall command line tools, but I have done so and I still get the above


sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 rstudioapi_0.10 magrittr_1.5 usethis_1.5.0 devtools_2.0.2 pkgload_1.0.2 R6_2.4.0 rlang_0.3.4.9003
[9] tools_3.6.0 pkgbuild_1.0.3 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.0.4 assertthat_0.2.1 digest_0.6.19
[17] rprojroot_1.3-2 crayon_1.3.4 processx_3.3.1 callr_3.2.0 fs_1.3.1 ps_1.3.0 curl_3.3 testthat_2.1.1
[25] memoise_1.1.0 glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.4 prettyunits_1.0.2

Install package

Hi, I wanted to install the package using devtools::install_github("dselivanov/rsparse") but the installation always fails. The error it shows is:

ERROR: dependency ‘float’ is not available for package ‘rsparse’

How can I fix this?

item_exclude

Hello Dmitriy ,

First of all thank you for your work. I've been using Rsparse since a couple of weeks and everything is running smooth and well. The addition of the evaluation metrics is really a plus and I've been toying around with data to see which algorithm works best with the data I have (currently Linear Flow).

There is something though that I can't make properly work. The argument "items_exclude" on the predict function is not working as it should when we provide the article name (stringà). I provide the names (string) of the object to exlclude but they end up in the prediction. I've looked in the code, this seems to come from these lines:

items_exclude = match(items_exclude, private$item_ids)
item_embeddings = item_embeddings[, -items_exclude, drop = FALSE]

I should have an output of 46 columns, but it gives me 122. I then use instead these lines which gives me the correct n of items:

items_exclude = Reduce(intersect, list(items_exclude,private$item_ids))
item_embeddings = item_embeddings[, !(colnames(item_embeddings) %in% items_exclude), drop=FALSE]

When I provide the itemsID, it gives me the right number of items to recommend, unfortunately it is not the right products so I'll look into that now.

single precision solvers

Allow to use single precision - possible to use 2x more data and fit 2x faster. See float package.

  • WRMF for implicit feedback
  • WRMF for explicit feedback
  • soft-svd

Cholesky solver

I'm getting some pretty bad results using WRMF with the Cholesky solver.

I tried doing a reproducible experiment as follows:

  • Took the LastFM360k data.
  • Set the first 357k users as train, rest ~1800 as test.
  • Fit a WRMF model with k=40 and lambda=1 and 10 iterations.
  • Calculated factors and top-5 predictions for the test users.
  • Calculated hit rate at 5 for them.

And got the following results:

  • CG: 0.4441335
  • Cholesky: 0.4179763

Which is certainly not what I'd expect as the Cholesky is a more exact method and should lead to better results. Using different random seeds did not result in any significant change.

I additionally see a strange behavior with the timings:

  • CG: ~21s
  • Chol: ~32s

According to the reference paper Applications of the conjugate gradient method for implicit feedback collaborative filtering, the Cholesky solver in this case should take close to 3x longer than the CG one. I did the same experiment with this package and got the expected results: Cholesky takes 3x longer and gets a better end result:

  • CG: ~23s | 0.4501615
  • Chol: ~64s | 0.4512379

Tried playing with the armadillo option arma::solve_opts::fast, changing it to something more exact, but it didn't make any difference in the HR@5 metric.

I'm not familiar with armadillo but I suspect this line is incorrect:

arma::Mat<T> inv = XtX + X_nnz.each_row() % (confidence.t() - 1) * X_nnz.t();

Since it's doing an operation per row, whereas it should be doing rank-1 updates.

Non-negativity constraints

In WRMF, there is an option non_negative, which is handled by first finding the unconstrained factors and then setting them to zero if they turn out negative:

# if need non-negative matrix factorization - just set all negative values to zero
if (private$non_negative) private$U[private$U < 0] = 0

But in general, the solution of a constrained linear system is not the solution of the unconstrained system with the non-conforming coefficients set to the bounds (see e.g. https://en.wikipedia.org/wiki/Non-negative_least_squares).

Now, since it iterates multiple times, the fitted factors are probably not going to be too bad (save for the last iteration), but then it again does the same thing when calling transform, which is not optimal.

Did this technique come from some reference?

devtools::install_github("dselivanov/rsparse") Win7 Will not compile.

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default


I could download and build 'float'.

devtools::install_github("dselivanov/float")
it works.

devtools::install_github("dselivanov/rsparse")

  • installing source package 'rsparse' ...
    WARNING: this package has a configure script
    It probably needs manual configuration

** libs
*** arch - i386
C:/RBuildTools/3.4/mingw_32/bin/g++ -I"C:/PROGRA1/R/R-341.4/include" -DNDEBUG -I"C:/Users/reefej/R/win-library/3.4/Rcpp/include" -I"C:/Users/reefej/R/win-library/3.4/RcppArmadillo/include" -O2 -Wall -mtune=generic -c FTRL.cpp -o FTRL.o
In file included from C:/RBuildTools/3.4/mingw_32/i686-w64-mingw32/include/c++/random:35:0,
from FTRL.cpp:1:

C:/RBuildTools/3.4/mingw_32/i686-w64-mingw32/include/c++/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support for the
^
In file included from rsparse.h:1:0,
from FTRL.cpp:2:

I am not sure how to change settings in downloaded package. After, rsparse is not there.

HOWEVER
Observing from another issue,

install.packages('githubinstall')
library('githubinstall')
githubinstall::gh_install_packages("rsparse", ask = FALSE, force=TRUE)

This works fine and rsparse is there.

Oops. No.

I got this:

githubinstall::gh_install_packages("rsparse", ask = FALSE, force=TRUE)

Downloading GitHub repo dgrtwo/rparse@master
from URL https://api.github.com/repos/dgrtwo/rparse/zipball/master
Installing rparse
"C:/PROGRA1/R/R-341.4/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
"C:/Users/reefej/AppData/Local/Temp/RtmpykTwMI/devtools19686420f4c/dgrtwo-rparse-bdef159" --library="C:/Users/reefej/R/win-library/3.4"
--install-tests

It then downloads into rparse. And I cannot load the library.

How to get ALS function ?

Hello Dmitriy

How did you use ALS function ? I cannot find in the package. Is there any other package I should load?
The command below doesn't work.

model = ALS$new(rank = 8)

I am trying to reproduce the results from your blog (matrix factorisation part 2) and faced this problem.

CRAN release

This package looks really impressive. Are you planning to put it on CRAN?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.