david-cortes / ctpfrec Goto Github PK
View Code? Open in Web Editor NEWPython implementation of "Content-based recommendations with poisson factorization", with some extensions
License: BSD 2-Clause "Simplified" License
Python implementation of "Content-based recommendations with poisson factorization", with some extensions
License: BSD 2-Clause "Simplified" License
I've noticed when training the ctpfrec model it outputs training error, unlike hpfrec which outputs validation error. This is going to be problematic for model selection as obviously the training error will continue to decrease with model complexity and therefore result in overfitting on the test set. Do you have any advice on how I can find out the validation error?
Thank you
I am trying to restrict the set of items ctpfrec recommends. My items are each uniquely identified by a string e.g '48069855'.
I have tried the following yet they all result in an error being thrown:
Using either recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=user_counts_test.ItemId.unique()
,
or
recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array(['48069855', '47994812', '47994813', '47811334', '47809545','47770950']) )
I'm presented with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
57 try:
---> 58 return bound(*args, **kwds)
59 except TypeError:
TypeError: Partition index must be integer
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-94-c4d8742971d3> in <module>()
5 # new_user_count = pd.DataFrame({'UserId': -1,'ItemId': ['48028651','48065053','48057353'],'Count': [1,1,1]})
6 # recommender.add_users(new_user_count)
----> 7 recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array(['48069855', '47994812', '47994813', '47811334', '47809545','47770950']) ) # think about excluding seen
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in topN(self, user, n, exclude_seen, items_pool)
1300 raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit")
1301
-> 1302 return self._topN(self._M1[user], n, exclude_seen, items_pool, user)
1303
1304 def topN_cold(self, user_df, n=10, items_pool=None, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3):
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _topN(self, user_vec, n, exclude_seen, items_pool, user)
1245 if exclude_seen:
1246 n_ext = np.min([n + self._n_seen_by_user[user], items_pool.shape[0]])
-> 1247 rec = np.argpartition(allpreds, n_ext-1)[:n_ext]
1248 seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]]
1249 if self.reindex:
<__array_function__ internals> in argpartition(*args, **kwargs)
/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in argpartition(a, kth, axis, kind, order)
830
831 """
--> 832 return _wrapfunc(a, 'argpartition', kth, axis=axis, kind=kind, order=order)
833
834
/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
65 # Call _wrapit from within the except clause to ensure a potential
66 # exception has a traceback chain.
---> 67 return _wrapit(obj, method, *args, **kwds)
68
69
/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
42 except AttributeError:
43 wrap = None
---> 44 result = getattr(asarray(obj), method)(*args, **kwds)
45 if wrap:
46 if not isinstance(result, mu.ndarray):
TypeError: Partition index must be integer
If I call the ItemIds as integers - rather than their original string format:
recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array([48069855, 47994812, 47994813, 4781133, 47809545, 47770950]) )
I am presented with the error:
ValueError Traceback (most recent call last)
<ipython-input-97-6594e242d050> in <module>()
5 # new_user_count = pd.DataFrame({'UserId': -1,'ItemId': ['48028651','48065053','48057353'],'Count': [1,1,1]})
6 # recommender.add_users(new_user_count)
----> 7 recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array([48069855, 47994812, 47994813, 4781133, 47809545, 47770950]) ) # think about excluding seen
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in topN(self, user, n, exclude_seen, items_pool)
1300 raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit")
1301
-> 1302 return self._topN(self._M1[user], n, exclude_seen, items_pool, user)
1303
1304 def topN_cold(self, user_df, n=10, items_pool=None, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3):
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _topN(self, user_vec, n, exclude_seen, items_pool, user)
1230 del nan_ix
1231 if items_pool_reind.shape[0] == 0:
-> 1232 raise ValueError("No items to recommend.")
1233 elif items_pool_reind.shape[0] == 1:
1234 raise ValueError("Only 1 item to recommend.")
ValueError: No items to recommend.
It appears that ctpfrec is unable to make out-of-matrix prediction, i.e. it can't recommend items without any ratings/clicks/plays/etc.
You did ask my to upload a toy dataset to show you which I am having trouble doing. I am also unable to upload the datasets I am using due to GDPR.
It is however very simple: I have three sets (in the required pandas triplet form {"UserId" : , "ItemId" : , "Count" : }) of user click data user_counts_train
, user_counts_validation
and user_counts_test
, and another set word_counts
for the items (in the required pandas triplet form {"ItemId" : , "WordId" : , "Count" : }).
Importantly, there are no items in the three user sets that aren't in the word_counts set.
I fit my model using the training and validation sets:
recommender.fit(counts_df=user_counts_train, words_df=word_counts, val_set=user_counts_validation)
The issue is when I attempt to make an out-of-matrix prediction using an item that appears only in the user_counts_test
and word_counts
sets via:
new_user_count = pd.DataFrame({'UserId': 1.,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]}) # user clicks on item not in the training or validation sets
recommender4.add_users(new_user_count) # add new item to recommender4
recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # output top k recommendations
Is the issue with ctpfrec itself, or the way I am attempting to add a new user history and make predictions with topN
?
Thank you
When trying to add new users to a previously trained model, NumPy complains with:
1813 if self.keep_data and (counts_df is not None):
-> 1814 for u in range(new_max_id):
1815 items_this_user = counts_df.ItemId.values[counts_df.UserId == u]
TypeError: 'numpy.float64' object cannot be interpreted as an integer
It seems that new_max_id is casted to a numpy float somewhere along the way - even when the counts and words dataframes only contain integer identifiers.
Minimal code block to reproduce the error:
import numpy as np
import pandas as pd
from ctpfrec import CTPF
# Dummy data
counts_df = pd.DataFrame([[0,0,1],[0,1,1]], columns = ['UserId','ItemId','Count'])
words_df = pd.DataFrame([[0,0,1],[0,1,1]], columns = ['ItemId','WordId','Count'])
# Fit model
recommender = CTPF(k = 5,
reindex = True)
recommender.fit(counts_df = counts_df,
words_df = words_df)
# Generate new dummy user
counts_df_new = pd.DataFrame([[1,0,1],[1,1,1]], columns = ['UserId','ItemId','Count'])
# Add new dummy user !< This breaks
recommender.add_users(counts_df = counts_df_new)
I suspected this might have had something to do with the re-indexing, but when disabling this option in the instantiation of the CTPF object, the fit call complains with:
--> 896 items_intersect = np.in1d(items_words_df, items_counts_df)
897 words_include = self._words_df.WordId.loc[np.in1d(self._words_df.ItemId, items_words_df[items_intersect])].unique()
NameError: name 'items_counts_df' is not defined
My NumPy version is 1.18.1.
I am trying to add new articles to the recommender class using recommender.add_items(word_counts_test)
however I am presented with the error message "ValueError: Categorical categories must be unique". Can you please explain to me what this means exactly? My pandas data frame word_counts_test
is in the required form of
columns={"ItemId":,"WordId":, "Count":}
.
Surely all three columns will have non unique categorical values as the articles contain more than a single word and words appear in multiple articles?
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.