YWBAT
- condition data for a recommender system
- apply cosine similarity to recommend jokes
- describe the pros and cons of using cosine similarity
- The angle between two vectors
- if cosine(v1, v2) == 0 -> perpendicular
- if cosine(v1, v2) == 1 -> same direction
- if cosine(v1, v2) == -1 -> opposite direction
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
import matplotlib.pyplot as plt
import seaborn as sns
Format:
- Data files are in .zip format, when unzipped, they are in Excel (.xls) format
- Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
- One row per user
- The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
- The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).
df = pd.read_excel("./data/jester-data-1.xls", header=None)
print(df.shape)
df.head()
(24983, 101)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 74 | -7.82 | 8.79 | -9.66 | -8.16 | -7.52 | -8.50 | -9.85 | 4.17 | -8.98 | ... | 2.82 | 99.00 | 99.00 | 99.00 | 99.00 | 99.00 | -5.63 | 99.00 | 99.00 | 99.00 |
1 | 100 | 4.08 | -0.29 | 6.36 | 4.37 | -2.38 | -9.66 | -0.73 | -5.34 | 8.88 | ... | 2.82 | -4.95 | -0.29 | 7.86 | -0.19 | -2.14 | 3.06 | 0.34 | -4.32 | 1.07 |
2 | 49 | 99.00 | 99.00 | 99.00 | 99.00 | 9.03 | 9.27 | 9.03 | 9.27 | 99.00 | ... | 99.00 | 99.00 | 99.00 | 9.08 | 99.00 | 99.00 | 99.00 | 99.00 | 99.00 | 99.00 |
3 | 48 | 99.00 | 8.35 | 99.00 | 99.00 | 1.80 | 8.16 | -2.82 | 6.21 | 99.00 | ... | 99.00 | 99.00 | 99.00 | 0.53 | 99.00 | 99.00 | 99.00 | 99.00 | 99.00 | 99.00 |
4 | 91 | 8.50 | 4.61 | -4.17 | -5.39 | 1.36 | 1.60 | 7.04 | 4.61 | -0.44 | ... | 5.19 | 5.58 | 4.27 | 5.19 | 5.73 | 1.55 | 3.11 | 6.55 | 1.80 | 1.60 |
5 rows ร 101 columns
v1 = np.array([1, 2])
v2 = np.array([1, 2.5])
cosine_similarity(v1.reshape(1, -1), v2.reshape(1, -1)), cosine_distances(v1.reshape(1, -1), v2.reshape(1, -1))
(array([[0.99654576]]), array([[0.00345424]]))
- How do we recommend a joke to userA?
- user to user ->
- find users that are similar to userA
- recommend highly rated jokes that userA has not rated by those users to userA
- user to user ->
# we need to replace the 99s with 0s
# but 0 is on the scale...
# moves everything up by 11 and removes the negatives new rating scale is between 1 and 21
# nevermind adding 11 is a terrible idea...
# let's just not do anything...
# build a flow for a given user then turn this into a function
user_index = 0
userA = df.drop(0, axis=1).loc[user_index, :]
# let's get the other users
others = df.drop(0, axis=1).drop(index=user_index, axis=0)
# let's find the nearest neighbors
knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
knn.fit(others)
NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
metric_params=None, n_jobs=-1, n_neighbors=5, p=2, radius=1.0)
distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
distances, indices = distances[0], indices[0]
distances, indices
(array([0.02494242, 0.03028924, 0.0435472 , 0.04501014, 0.04511571]),
array([22358, 2255, 3509, 5175, 8767]))
# let's get jokes not rated by userA
jokes_not_rated = np.where(userA==99)[0]
jokes_not_rated
array([70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88,
89, 91, 92, 93, 94, 95, 97, 98, 99])
user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
user_jokes['total'] = user_jokes.T.sum()
user_jokes.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
22358 | 2255 | 3509 | 5175 | 8767 | total | |
---|---|---|---|---|---|---|
70 | -3.88 | -0.53 | 0.0 | -5.97 | 4.17 | -6.21 |
71 | -9.22 | -4.47 | 0.0 | 0.00 | 0.00 | -13.69 |
72 | -1.17 | 7.82 | 0.0 | 3.20 | 0.00 | 9.85 |
73 | -9.47 | 8.83 | 0.0 | 0.00 | 0.00 | -0.64 |
74 | -4.61 | 5.92 | 0.0 | 0.00 | 8.83 | 10.14 |
recommend_from = user_jokes['total'].idxmax()
recommend_from
86
# checking our work
user_jokes.ix[86, :] # .loc, .iloc, .ix
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
22358 8.010
2255 7.770
3509 0.000
5175 0.000
8767 0.000
total 3.156
Name: 86, dtype: float64
# build a flow for a given user then turn this into a function
def get_neighbors(userA, others):
knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
knn.fit(others)
distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
distances, indices = distances[0], indices[0]
return distances, indices
def get_recommended_joke(userA, indices):
# let's get jokes not rated by userA
jokes_not_rated = np.where(userA==99)[0]
user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
user_jokes['total'] = user_jokes.T.sum()
user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
user_jokes['total'] = user_jokes.T.sum()
recommended_joke = user_jokes['total'].idxmax()
return recommended_joke
def recommend_joke(user_index=0):
userA = df.drop(0, axis=1).loc[user_index, :]
try:
# nearest neighbors
others = df.drop(0, axis=1).drop(index=user_index, axis=0)
distances, indices = get_neighbors(userA, others)
# let's get the other users in a dataframe
recommended_joke = get_recommended_joke(userA, indices)
return recommended_joke
except:
print("user has rated all jokes")
return None
recommend_joke(1923)
user has rated all jokes
df.iloc[1923, :].replace(99, np.nan).isna().sum()
0
- cosine distance
- the recommendation algorithm doesn't always have to use knearestneighbors
- general workflow
- .ix as a slicer for dataframes
- .idxmax to get the index of max value