The cv2-mod5-sec41-recommender-systems-lesson from erdosn

Questions

Objectives

YWBAT

condition data for a recommender system
apply cosine similarity to recommend jokes
describe the pros and cons of using cosine similarity

What does cosine similarity measure?

The angle between two vectors
- if cosine(v1, v2) == 0 -> perpendicular
- if cosine(v1, v2) == 1 -> same direction
- if cosine(v1, v2) == -1 -> opposite direction

Outline

import pandas as pd
import numpy as np

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances


import matplotlib.pyplot as plt
import seaborn as sns

About the data

Format:

Data files are in .zip format, when unzipped, they are in Excel (.xls) format
Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
One row per user
The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).

df = pd.read_excel("./data/jester-data-1.xls", header=None)
print(df.shape)
df.head()

(24983, 101)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4	5	6	7	8	9	...	91	92	93	94	95	96	97	98	99	100
0	74	-7.82	8.79	-9.66	-8.16	-7.52	-8.50	-9.85	4.17	-8.98	...	2.82	99.00	99.00	99.00	99.00	99.00	-5.63	99.00	99.00	99.00
1	100	4.08	-0.29	6.36	4.37	-2.38	-9.66	-0.73	-5.34	8.88	...	2.82	-4.95	-0.29	7.86	-0.19	-2.14	3.06	0.34	-4.32	1.07
2	49	99.00	99.00	99.00	99.00	9.03	9.27	9.03	9.27	99.00	...	99.00	99.00	99.00	9.08	99.00	99.00	99.00	99.00	99.00	99.00
3	48	99.00	8.35	99.00	99.00	1.80	8.16	-2.82	6.21	99.00	...	99.00	99.00	99.00	0.53	99.00	99.00	99.00	99.00	99.00	99.00
4	91	8.50	4.61	-4.17	-5.39	1.36	1.60	7.04	4.61	-0.44	...	5.19	5.58	4.27	5.19	5.73	1.55	3.11	6.55	1.80	1.60

5 rows × 101 columns

v1 = np.array([1, 2])
v2 = np.array([1, 2.5])
cosine_similarity(v1.reshape(1, -1), v2.reshape(1, -1)), cosine_distances(v1.reshape(1, -1), v2.reshape(1, -1))

(array([[0.99654576]]), array([[0.00345424]]))

How do we build a recommender system?

How do we recommend a joke to userA?
- user to user ->
  - find users that are similar to userA
  - recommend highly rated jokes that userA has not rated by those users to userA

Let's condition the data for a recommender system

# we need to replace the 99s with 0s
# but 0 is on the scale...
# moves everything up by 11 and removes the negatives new rating scale is between 1 and 21
# nevermind adding 11 is a terrible idea...

# let's just not do anything...

# build a flow for a given user then turn this into a function

user_index = 0
userA = df.drop(0, axis=1).loc[user_index, :]

# let's get the other users
others = df.drop(0, axis=1).drop(index=user_index, axis=0)


# let's find the nearest neighbors
knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
knn.fit(others)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=5, p=2, radius=1.0)

distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
distances, indices = distances[0], indices[0]
distances, indices

(array([0.02494242, 0.03028924, 0.0435472 , 0.04501014, 0.04511571]),
 array([22358,  2255,  3509,  5175,  8767]))

Now that we have our most similar users, what's next?

Find their highest rated items that aren't rated by userA

# let's get jokes not rated by userA
jokes_not_rated = np.where(userA==99)[0]
jokes_not_rated

array([70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88,
       89, 91, 92, 93, 94, 95, 97, 98, 99])

user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
user_jokes['total'] = user_jokes.T.sum()
user_jokes.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	22358	2255	5175	8767	total
70	-3.88	-0.53	-5.97	4.17	-6.21
71	-9.22	-4.47	0.00	0.00	-13.69
72	-1.17	7.82	3.20	0.00	9.85
73	-9.47	8.83	0.00	0.00	-0.64
74	-4.61	5.92	0.00	8.83	10.14

recommend_from = user_jokes['total'].idxmax()
recommend_from

# checking our work
user_jokes.ix[86, :] # .loc, .iloc, .ix

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





22358    8.010
2255     7.770
3509     0.000
5175     0.000
8767     0.000
total    3.156
Name: 86, dtype: float64

Now let's merge and make a workflow

# build a flow for a given user then turn this into a function
def get_neighbors(userA, others):
    knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
    knn.fit(others)
    distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
    distances, indices = distances[0], indices[0]
    return distances, indices


def get_recommended_joke(userA, indices):
    # let's get jokes not rated by userA
    jokes_not_rated = np.where(userA==99)[0]

    user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
    user_jokes['total'] = user_jokes.T.sum()

    user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
    user_jokes['total'] = user_jokes.T.sum()
    recommended_joke = user_jokes['total'].idxmax()
    return recommended_joke



def recommend_joke(user_index=0):
    userA = df.drop(0, axis=1).loc[user_index, :]
    try:
        # nearest neighbors
        others = df.drop(0, axis=1).drop(index=user_index, axis=0)
        distances, indices = get_neighbors(userA, others)

        # let's get the other users in a dataframe
        recommended_joke = get_recommended_joke(userA, indices)
        return recommended_joke
    except:
        print("user has rated all jokes")
        return None

recommend_joke(1923)

user has rated all jokes

df.iloc[1923, :].replace(99, np.nan).isna().sum()

Assessment

cosine distance
the recommendation algorithm doesn't always have to use knearestneighbors
general workflow
.ix as a slicer for dataframes
.idxmax to get the index of max value

erdosn / cv2-mod5-sec41-recommender-systems-lesson Goto Github PK

cv2-mod5-sec41-recommender-systems-lesson's Introduction

Questions

Objectives

What does cosine similarity measure?

Outline

About the data

How do we build a recommender system?

Let's condition the data for a recommender system

Now that we have our most similar users, what's next?

Find their highest rated items that aren't rated by userA

Now let's merge and make a workflow

Assessment

cv2-mod5-sec41-recommender-systems-lesson's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent