david-cortes / hpfrec Goto Github PK

View Code? Open in Web Editor NEW

79.0 5.0 19.0 317 KB

Python implementation of 'Scalable Recommendation with Hierarchical Poisson Factorization'.

Home Page: http://hpfrec.readthedocs.io

License: BSD 2-Clause "Simplified" License

Python 66.60% Cython 33.40%

poisson-factorization implicit-feedback

hpfrec's Introduction

Hierarchical Poisson Factorization

This is a Python package for hierarchical Poisson factorization, a form of probabilistic matrix factorization used for recommender systems with implicit count data, based on the paper Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015).

Although the package was created with recommender systems in mind, it can also be used for other domains, e.g. as a faster alternative to LDA (Latent Ditichlet Allocation), where users become documents and items become words.

Supports parallelization, full-batch variational inference, mini-batch stochastic variational inference (alternating between epochs sampling batches of users and epochs sampling batches of items), and different stopping criteria for the coordinate-ascent procedure. The main computations are written in fast Cython code.

As a point of reference, fitting the model through full-batch updates to the MillionSong TasteProfile dataset (48M records from 1M users on 380K items) took around 45 minutes on a server from Google Cloud with Skylake CPU when using 24 cores.

For a similar package using also item/user side information see ctpfrec.

For a non-Bayesian version which can produce sparse factors see poismf.

Note: this package can also be used from within LensKit, which adds functionalities such as cross-validation and calculation of recommendation quality metrics.

Model description

The model consists in producing a non-negative low-rank matrix factorization of counts data (such as number of times each user played each song in some internet service) Y ~= UV', produced by a generative model as follows:

ksi_u ~ Gamma(a_prime, a_prime/b_prime)
Theta_uk ~ Gamma(a, ksi_u)

eta_i ~ Gamma(c_prime, c_prime/d_prime)
Beta_ik ~ Gamma(c, eta_i)

Y_ui ~ Poisson(Theta_u' Beta_i)

The parameters are fit using mean-field approximation (a form of Bayesian variational inference) with coordinate ascent (updating each parameter separately until convergence).

Installation

Note: requires a C compiler configured for Python. See this guide for instructions.

Package is available on PyPI, can be installed with:

pip install hpfrec

Or if that fails:

pip install --no-use-pep517 hpfrec

Note for macOS users: on macOS, the Python version of this package might compile without multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:

brew install libomp

And then reinstall this package: pip install --upgrade --no-deps --force-reinstall hpfrec.

IMPORTANT: the setup script will try to add compilation flag -march=native. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable DONT_SET_MARCH=1, or by (b) manually supplying compilation CFLAGS as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:

export DONT_SET_MARCH=1
pip install hpfrec

or, for forcing a maximum-compatibility x86-64 binary:

export CFLAGS="-march=x86-64"
pip install hpfrec

Sample usage

import pandas as pd, numpy as np
from hpfrec import HPF

## Generating sample counts data
nusers = 10**2
nitems = 10**2
nobs   = 10**4

np.random.seed(1)
counts_df = pd.DataFrame({
	'UserId' : np.random.randint(nusers, size=nobs),
	'ItemId' : np.random.randint(nitems, size=nobs),
	'Count' :  (np.random.gamma(1,1, size=nobs) + 1).astype('int32')
	})
counts_df = counts_df.loc[~counts_df[['UserId', 'ItemId']].duplicated()].reset_index(drop=True)

## Initializing the model object
recommender = HPF()

## For stochastic variational inference, need to select batch size (number of users)
recommender = HPF(users_per_batch = 20)

## Full function call
recommender = HPF(
	k=30, a=0.3, a_prime=0.3, b_prime=1.0,
	c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
	stop_crit='train-llk', check_every=10, stop_thr=1e-3,
	users_per_batch=None, items_per_batch=None, step_size=lambda x: 1/np.sqrt(x+2),
	maxiter=100, use_float=True, reindex=True, verbose=True,
	random_seed=None, allow_inconsistent_math=False, full_llk=False,
	alloc_full_phi=False, keep_data=True, save_folder=None,
	produce_dicts=True, keep_all_objs=True, sum_exp_trick=False
)

## Fitting the model to the data
recommender.fit(counts_df)

## Fitting the model while monitoring a validation set
recommender = HPF(stop_crit='val-llk')
recommender.fit(counts_df, val_set=counts_df.sample(10**2))
## Note: a real validation should NEVER be a subset of the training set

## Fitting the model to data in batches passed by the user
recommender = HPF(reindex=False, keep_data=False)
users_batch1 = np.unique(np.random.randint(10**2, size=20))
users_batch2 = np.unique(np.random.randint(10**2, size=20))
users_batch3 = np.unique(np.random.randint(10**2, size=20))
recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch1)], nusers=10**2, nitems=10**2)
recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch2)])
recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch3)])

## Making predictions
# recommender.topN(user=10, n=10, exclude_seen=True) ## not available when using 'partial_fit'
recommender.topN(user=10, n=10, exclude_seen=False, items_pool=np.array([1,2,3,4]))
recommender.predict(user=10, item=11)
recommender.predict(user=[10,10,10], item=[1,2,3])
recommender.predict(user=[10,11,12], item=[4,5,6])

## Evaluating Poisson likelihood
recommender.eval_llk(counts_df, full_llk=True)

## Determining latent factors for a new user, given her item interactions
nobs_new = 20
np.random.seed(2)
counts_df_new = pd.DataFrame({
	'ItemId' : np.random.choice(np.arange(nitems), size=nobs_new, replace=False),
	'Count' : np.random.gamma(1,1, size=nobs_new).astype('int32')
	})
counts_df_new = counts_df_new.loc[counts_df_new.Count > 0].reset_index(drop=True)
recommender.predict_factors(counts_df_new)

## Adding a user without refitting the whole model
recommender.add_user(user_id=nusers+1, counts_df=counts_df_new)

## Updating data for an existing user without refitting the whole model
chosen_user = counts_df.UserId.values[10]
recommender.add_user(user_id=chosen_user, counts_df=counts_df_new, update_existing=True)

If passing reindex=True, all user and item IDs that you pass to .fit will be reindexed internally (they need to be hashable types like str, int or tuple), and you can use these same IDs to make predictions later. The IDs returned by predict and topN are these IDs passed to .fit too.

For a more detailed example, see the IPython notebook recommending songs with EchoNest MillionSong dataset illustrating its usage with the EchoNest TasteProfile dataset.

Documentation

Documentation is available at readthedocs: http://hpfrec.readthedocs.io

It is also internally documented through docstrings (e.g. you can try help(hpfrec.HPF)), help(hpfrec.HPF.fit), etc.

Serializing (pickling) the model

Don't use pickle to save an HPF object, as it will fail due to problems with lambda functions. Rather, use dill instead, which has the same syntax as pickle:

import dill
from hpfrec import HPF

h = HPF()
dill.dump(h, open("HPF_obj.dill", "wb"))
h = dill.load(open("HPF_obj.dill", "rb"))

Speeding up optimization procedure

For faster fitting and predictions, use SciPy and NumPy libraries compiled against MKL or OpenBLAS. These come by default with MKL in Anaconda installations.

The constructor for HPF allows some parameters to make it run faster (if you know what you're doing): these are allow_inconsistent_math=True, full_llk=False, stop_crit='diff-norm', reindex=False, verbose=False. See the documentation for more details.

Using stochastic variational inference, which fits the data in smaller batches containing all the user-item interactions only for subsets of users, might converge in fewer iterations (epochs), but the results tend be slightly worse.

References

[1] Gopalan, Prem, Jake M. Hofman, and David M. Blei. "Scalable Recommendation with Hierarchical Poisson Factorization." UAI. 2015.
[2] Gopalan, Prem, Jake M. Hofman, and David M. Blei. "Scalable recommendation with poisson factorization." arXiv preprint arXiv:1311.1704 (2013).
[3] Hoffman, Matthew D., et al. "Stochastic variational inference." The Journal of Machine Learning Research 14.1 (2013): 1303-1347.

hpfrec's People

Contributors

Stargazers

Watchers

Forkers

xuetf abs51295 mdobson-cs mindis pombredanne yetanotherion lucky-suman silence28 cafew mldeveloper01 seeker1943 fagan2888 huyang-pku zshwuhan echo-valor lorry-ruiluo irisdodo shawnsu007 zilongxie

hpfrec's Issues

hpfrec on Python 3.8 vs Python 3.9

Hey David, hope your day is going well. We encountered an issue when pip installing hpfrec on Python 3.9. On Python 3.8, the package works great. We made a video here so that you can take a look. Cheers!

video1440725217.mp4

The Example code given in the README throws a NameError

When I try to run the example code in the readme, bar the second and third training calls, like:

import pandas as pd, numpy as np
from hpfrec import HPF

## Generating sample counts data
nusers = 10**2
nitems = 10**2
nobs = 10**4

np.random.seed(1)
counts_df = pd.DataFrame({
	'UserId' : np.random.randint(nusers, size=nobs),
	'ItemId' : np.random.randint(nitems, size=nobs),
	'Count' : np.random.gamma(1,1, size=nobs).astype('int32')
	})
counts_df = counts_df.loc[counts_df.Count > 0].reset_index(drop=True)

## Initializing the model object
recommender = HPF()

## For stochastic variational inference, need to select batch size (number of users)
recommender = HPF(users_per_batch = 20)

## Full function call
recommender = HPF(
	k=30, a=0.3, a_prime=0.3, b_prime=1.0,
	c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
	stop_crit='train-llk', check_every=10, stop_thr=1e-3,
	users_per_batch=None, items_per_batch=None, step_size=lambda x: 1/np.sqrt(x+2),
	maxiter=100, reindex=True, verbose=True,
	random_seed = None, allow_inconsistent_math=False, full_llk=False,
	alloc_full_phi=False, keep_data=True, save_folder=None,
	produce_dicts=True, keep_all_objs=True, sum_exp_trick=False
)

## Fitting the model to the data
recommender.fit(counts_df)

## Fitting the model while monitoring a validation set
# recommender = HPF(stop_crit='val-llk')
# recommender.fit(counts_df, val_set=counts_df.sample(10**2))
## Note: a real validation should NEVER be a subset of the training set

## Fitting the model to data in batches passed by the user
# recommender = HPF(reindex=False, keep_data=False)
# users_batch1 = np.unique(np.random.randint(10**2, size=20))
# users_batch2 = np.unique(np.random.randint(10**2, size=20))
# users_batch3 = np.unique(np.random.randint(10**2, size=20))
# recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch1)], nusers=10**2, nitems=10**2)
# recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch2)])
# recommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch3)])

## Making predictions
recommender.topN(user=10, n=10, exclude_seen=True)
recommender.topN(user=10, n=10, exclude_seen=False, items_pool=np.array([1,2,3,4]))
recommender.predict(user=10, item=11)
recommender.predict(user=[10,10,10], item=[1,2,3])
recommender.predict(user=[10,11,12], item=[4,5,6])

## Evaluating Poisson likelihood
recommender.eval_llk(counts_df, full_llk=True)

## Determining latent factors for a new user, given her item interactions
nobs_new = 20
np.random.seed(2)
counts_df_new = pd.DataFrame({
	'ItemId' : np.random.choice(np.arange(nitems), size=nobs_new, replace=False),
	'Count' : np.random.gamma(1,1, size=nobs_new).astype('int32')
	})
counts_df_new = counts_df_new.loc[counts_df_new.Count > 0].reset_index(drop=True)
recommender.predict_factors(counts_df_new)

## Adding a user without refitting the whole model
recommender.add_user(user_id=nusers+1, counts_df=counts_df_new)

## Updating data for an existing user without refitting the whole model
chosen_user = counts_df.UserId.values[10]
recommender.add_user(user_id=chosen_user, counts_df=counts_df_new, update_existing=True)

I get the error below:

Traceback (most recent call last):
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/hpfrec/__init__.py", line 635, in _process_data_single
    counts_df['ItemId'] = counts_df.ItemId.map(lambda x: self.item_dict_[user])
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/pandas/core/series.py", line 2998, in map
    arg, na_action=na_action)
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/pandas/core/base.py", line 1004, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/hpfrec/__init__.py", line 635, in <lambda>
    counts_df['ItemId'] = counts_df.ItemId.map(lambda x: self.item_dict_[user])
NameError: name 'user' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_hpf.py", line 70, in <module>
    recommender.predict_factors(counts_df_new)
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/hpfrec/__init__.py", line 960, in predict_factors
    counts_df = self._process_data_single(counts_df)
  File "/Users/avgupta/.pyenv/versions/hpfrec/lib/python3.6/site-packages/hpfrec/__init__.py", line 637, in _process_data_single
    raise ValueError("Can only make calculations for items that were in the training set.")
ValueError: Can only make calculations for items that were in the training set.

My environment info:

Cython==0.28.5
hpfrec==0.2.2.1
numpy==1.15.1
pandas==0.23.4
python-dateutil==2.7.3
pytz==2018.5
scipy==1.1.0
six==1.11.0

Finding the top items in each topic

I am trying to find the top items which appear in each of the $K$ topics. From the documentation I have assumed that I must call

recommender = HPF(k=10, a=0.3, a_prime=0.3, b_prime=1.0,
                 c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
                 stop_crit='val-llk', check_every=10, stop_thr=1e-3,
                 maxiter=150, reindex=True, random_seed = 123,
                 allow_inconsistent_math=False, verbose=True, full_llk=False,
                 keep_data=True, save_folder=None, produce_dicts=True)
recommender.fit(train, val_set = validation)

recommender.item_dict

however the recommender.item_dict outputs:

AttributeError                            Traceback (most recent call last)
<ipython-input-24-05d52252689c> in <module>()
----> 1 recommender.item_dict

AttributeError: 'HPF' object has no attribute 'item_dict'

Have I missed sometime when I call HPF?

Comparing models for hyperparameter optimization

I tried to optimize the number of components by using a proper train/test split with fit(val_set) and running eval_llk at the end, however I got 'input_df' has no combinations of users and itemsin common with the training set. regardless of whether I ran eval_llk on the training/test/complete data set.

Not compatible with the latest version of Python

I was told that HPFREC is not compatible with Python 3.9 and 3.10. Can this be please fixed?
Thank you, Esfandiar

How to get confidence score for recommended packages?

Hey @david-cortes ,

I can see the package numbers in recommendation. Along with that, can I see the confidence score of the recommended packages as well? Based on my understanding of the code so far, I think it's the same score that's used to sort the packages in topN() function. But, I am unable to figure out as to how can I get that score back. I guess it will be a nice metric to know. Let me know if you have any ideas and if I can send a PR to include that in the library. Thanks.

Distinguishing 0s from missing data

Hello David,

Thank you very much for sharing this module. It's been really useful.

I am trying to fit a model on a dataframe D with the option to optimize the likelihood in a validation set. The way I am preparing the validation set is by setting apart a random subset of rows of D, call this subset D'. Then, my training set is D - D', and my validation set is D'.

Is this the intended way to use the package? To me, it seems that, by removing rows from the training set, we are implicitly stating that these entries are 0 in the user-item matrix. I would expect that we have to modify the likelihood calculation to account for the fact that D' is missing data, not 0s. I read the code, but I couldn't find any kind of accounting.

As a follow-up question, wouldn't it be desirable to have both zero AND non-zero entries in the validation set? By not allowing non-zero entries, aren't we biasing the inference?

Thank you very much for reading!

Best,

Jose

Unable to install hpfrec

I am trying to pip install hpfrec however I am presented with ERROR: Could not build wheels for hpfrec which use PEP 517 and cannot be installed directly

I have attempted to solve the issue with two stack overflow results:

https://stackoverflow.com/questions/55962678/installing-chatterbot-but-getting-error-could-not-build-wheels-for-spacy-which

https://stackoverflow.com/questions/59441794/error-could-not-build-wheels-for-cryptography-which-use-pep-517-and-cannot-be-i

however neither fix the problem.

Recovering model from save_folder files

Hey David,

Thank you for the code. We have a fairly large dataset, and I am interested in saving the recommender model to a file so I can avoid training over and over again (we don't want to call recommender.fit() repetitively). We put a google drive path to the save_folder field and are able to see the files. Is there a way we can recover the model using these files?

Items-items recommendation and precomputing factorization once.

Hi David:
I have a question about speed up the calculations by doing the factorizations ones.
Let's assume we already have a (large) users-items matrix or data frame. Is there a way to precompute an items-items-prediction matrix where given a set of items (and their counts) that a new user inputs, then we can suggest the topN items?
For instance, if we had a items x latent_Item_Values matrix and multiplied it on the right by it's own transpose, does that matrix (or sqrt root of its elements) give us an items-items matrix? The idea is to create an items-items matrix that we can multiple with the vector of new counts from the new user and find the topN.
The goal is to avoid appending the new user-items to the end of the data, and avoid factorizing the entire data again, to save some time and computation.
Thanks in advance, Esfandiar

Persistent Hanging of hpfrec.HPF().fit() in Production ML Pipeline

I'll preface this issue by saying that I know I do not have enough information to provide to expect a solution, but I figured I raise this concern in case anyone else has been troubleshooting a similar problem.

Context
Our production ML pipeline that relies on this package has been operational without any issues for the past year. However, starting from July 17th, 2023, the pipeline has been persistently hanging when calling the hpfrec.HPF().fit() function, leading to a significant increase in job processing time from 1 hour to over 50 hours (without completion).

Symptoms
The verbose messaging displays as follows during the hang:

Number of users: 820137
Number of items: 472
Latent factors to use: 50

Initializing parameters...
Allocating Phi matrix...
Initializing optimization procedure...

This problem occurs even after ensuring that the data fed into the model hasn't been altered, and is reproduced when rolling back the data to previous dates.

Environment Details

Databricks cluster specs:
Driver: r4.8xlarge
Workers: r4.8xlarge
2-8 workers
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Python 3.7
Located in us-east-1c
Package version: hpfrec==0.2.3.1

Code Snippet
Here is the code snippet for initializing and fitting the model:

model = hpfrec.HPF(
    k=50,
    full_llk=False,
    random_seed=123,
    check_every=10,
    maxiter=150,
    reindex=True,
    allow_inconsistent_math=True,
    ncores=-1,
    stop_crit="diff-norm",
    verbose=True,
)
model.fit(input_data)

We are unsure of the cause of this issue and are eager to troubleshoot this to maintain the efficiency of our production pipeline. Any insights or advice would be highly appreciated.

Bug - add_user() only works with int ids, our count_df has UserId and ItemId as strings.

Hey David, good job on the code. I'm experiencing a bug with add_user(), hoping you can can take a look with this video.

zoom_1.mp4