Coder Social home page Coder Social logo

cv2-mod5-sec41-recommender-systems-lesson's Introduction

Questions

Objectives

YWBAT

  • condition data for a recommender system
  • apply cosine similarity to recommend jokes
  • describe the pros and cons of using cosine similarity

What does cosine similarity measure?

  • The angle between two vectors
    • if cosine(v1, v2) == 0 -> perpendicular
    • if cosine(v1, v2) == 1 -> same direction
    • if cosine(v1, v2) == -1 -> opposite direction

Outline

import pandas as pd
import numpy as np

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances


import matplotlib.pyplot as plt
import seaborn as sns

About the data

Format:

  • Data files are in .zip format, when unzipped, they are in Excel (.xls) format
  • Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
  • One row per user
  • The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
  • The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).
df = pd.read_excel("./data/jester-data-1.xls", header=None)
print(df.shape)
df.head()
(24983, 101)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99 100
0 74 -7.82 8.79 -9.66 -8.16 -7.52 -8.50 -9.85 4.17 -8.98 ... 2.82 99.00 99.00 99.00 99.00 99.00 -5.63 99.00 99.00 99.00
1 100 4.08 -0.29 6.36 4.37 -2.38 -9.66 -0.73 -5.34 8.88 ... 2.82 -4.95 -0.29 7.86 -0.19 -2.14 3.06 0.34 -4.32 1.07
2 49 99.00 99.00 99.00 99.00 9.03 9.27 9.03 9.27 99.00 ... 99.00 99.00 99.00 9.08 99.00 99.00 99.00 99.00 99.00 99.00
3 48 99.00 8.35 99.00 99.00 1.80 8.16 -2.82 6.21 99.00 ... 99.00 99.00 99.00 0.53 99.00 99.00 99.00 99.00 99.00 99.00
4 91 8.50 4.61 -4.17 -5.39 1.36 1.60 7.04 4.61 -0.44 ... 5.19 5.58 4.27 5.19 5.73 1.55 3.11 6.55 1.80 1.60

5 rows ร— 101 columns

v1 = np.array([1, 2])
v2 = np.array([1, 2.5])
cosine_similarity(v1.reshape(1, -1), v2.reshape(1, -1)), cosine_distances(v1.reshape(1, -1), v2.reshape(1, -1))
(array([[0.99654576]]), array([[0.00345424]]))

How do we build a recommender system?

  • How do we recommend a joke to userA?
    • user to user ->
      • find users that are similar to userA
      • recommend highly rated jokes that userA has not rated by those users to userA

Let's condition the data for a recommender system

# we need to replace the 99s with 0s
# but 0 is on the scale...
# moves everything up by 11 and removes the negatives new rating scale is between 1 and 21
# nevermind adding 11 is a terrible idea...

# let's just not do anything...
# build a flow for a given user then turn this into a function

user_index = 0
userA = df.drop(0, axis=1).loc[user_index, :]

# let's get the other users
others = df.drop(0, axis=1).drop(index=user_index, axis=0)


# let's find the nearest neighbors
knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
knn.fit(others)
NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=5, p=2, radius=1.0)
distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
distances, indices = distances[0], indices[0]
distances, indices
(array([0.02494242, 0.03028924, 0.0435472 , 0.04501014, 0.04511571]),
 array([22358,  2255,  3509,  5175,  8767]))

Now that we have our most similar users, what's next?

Find their highest rated items that aren't rated by userA

# let's get jokes not rated by userA
jokes_not_rated = np.where(userA==99)[0]
jokes_not_rated
array([70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88,
       89, 91, 92, 93, 94, 95, 97, 98, 99])
user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
user_jokes['total'] = user_jokes.T.sum()
user_jokes.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
22358 2255 3509 5175 8767 total
70 -3.88 -0.53 0.0 -5.97 4.17 -6.21
71 -9.22 -4.47 0.0 0.00 0.00 -13.69
72 -1.17 7.82 0.0 3.20 0.00 9.85
73 -9.47 8.83 0.0 0.00 0.00 -0.64
74 -4.61 5.92 0.0 0.00 8.83 10.14
recommend_from = user_jokes['total'].idxmax()
recommend_from
86
# checking our work
user_jokes.ix[86, :] # .loc, .iloc, .ix
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





22358    8.010
2255     7.770
3509     0.000
5175     0.000
8767     0.000
total    3.156
Name: 86, dtype: float64

Now let's merge and make a workflow

# build a flow for a given user then turn this into a function
def get_neighbors(userA, others):
    knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
    knn.fit(others)
    distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
    distances, indices = distances[0], indices[0]
    return distances, indices


def get_recommended_joke(userA, indices):
    # let's get jokes not rated by userA
    jokes_not_rated = np.where(userA==99)[0]

    user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
    user_jokes['total'] = user_jokes.T.sum()

    user_jokes = df.drop(0, axis=1).loc[indices, jokes_not_rated].T.replace(99, 0)
    user_jokes['total'] = user_jokes.T.sum()
    recommended_joke = user_jokes['total'].idxmax()
    return recommended_joke



def recommend_joke(user_index=0):
    userA = df.drop(0, axis=1).loc[user_index, :]
    try:
        # nearest neighbors
        others = df.drop(0, axis=1).drop(index=user_index, axis=0)
        distances, indices = get_neighbors(userA, others)

        # let's get the other users in a dataframe
        recommended_joke = get_recommended_joke(userA, indices)
        return recommended_joke
    except:
        print("user has rated all jokes")
        return None
recommend_joke(1923)
user has rated all jokes
df.iloc[1923, :].replace(99, np.nan).isna().sum()
0

Assessment

  • cosine distance
  • the recommendation algorithm doesn't always have to use knearestneighbors
  • general workflow
  • .ix as a slicer for dataframes
  • .idxmax to get the index of max value

cv2-mod5-sec41-recommender-systems-lesson's People

Contributors

erdos2n avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.