Coder Social home page Coder Social logo

cowjen01 / repsys Goto Github PK

View Code? Open in Web Editor NEW
35.0 2.0 5.0 12.79 MB

Framework for Interactive Evaluation of Recommender Systems

License: GNU General Public License v3.0

Shell 0.16% JavaScript 68.68% HTML 0.57% Python 30.40% Dockerfile 0.19%
python recommender-systems analysis-framework javascript web-application machine-learning evaluation-framework evaluation-metrics

repsys's Introduction

RepSys: Framework for Interactive Evaluation of Recommender Systems

PyPI version

The RepSys is a framework for developing and analyzing recommendation systems, and it allows you to:

  • Add your own dataset and recommendation models
  • Visually evaluate the models on various metrics
  • Quickly create dataset embeddings to explore the data
  • Preview recommendations using a web application
  • Simulate user's behavior while receiving the recommendations

web preview

Online Demo

You can now try RepSys online on our demo site with the Movielens dataset. Also, check out an interactive blog post we made using the RepSys widgets component.

Publication

Our paper "RepSys: Framework for Interactive Evaluation of Recommender Systems" was accepted to the RecSys'22 conference.

Installation

Install the package using pip:

$ pip install repsys-framework

If you will be using PyMDE for data visualization, you need to install RepSys with the following extras:

$ pip install repsys-framework[pymde]

Getting Started

If you want to skip this tutorial and try the framework, you can pull the content of the demo folder located at the repository. As mentioned in the next step, you still have to download the dataset before you begin.

Otherwise, please create an empty project folder that will contain the dataset and models implementation.

├── __init__.py
├── dataset.py
├── models.py
├── repsys.ini
└── .gitignore

dataset.py

Firstly we need to import our dataset. We will use MovieLens 20M Dataset with 20 million ratings made by 138,000 users to 27,000 movies for the tutorial purpose. Please download the ml-20m.zip file and unzip the data into the current folder. Then add the following content to the dataset.py file:

import pandas as pd

from repsys import Dataset
import repsys.dtypes as dtypes

class MovieLens(Dataset):
    def name(self):
        return "ml20m"

    def item_cols(self):
        return {
            "movieId": dtypes.ItemID(),
            "title": dtypes.Title(),
            "genres": dtypes.Tag(sep="|"),
            "year": dtypes.Number(data_type=int),
        }

    def interaction_cols(self):
        return {
            "movieId": dtypes.ItemID(),
            "userId": dtypes.UserID(),
            "rating": dtypes.Interaction(),
        }

    def load_items(self):
        df = pd.read_csv("./ml-20m/movies.csv")
        df["year"] = df["title"].str.extract(r"\((\d+)\)")
        return df

    def load_interactions(self):
        df = pd.read_csv("./ml-20m/ratings.csv")
        return df

This code will define a new dataset called ml20m, and it will import both ratings and items data. You must always specify your data structure using predefined data types. Before you return the data, you can also preprocess it like extracting the movie's year from the title column.

models.py

Now we define the first recommendation model, which will be a simple implementation of the user-based KNN.

import numpy as np
import scipy.sparse as sp
from sklearn.neighbors import NearestNeighbors

from repsys import Model

class KNN(Model):
    def __init__(self):
        self.model = NearestNeighbors(n_neighbors=20, metric="cosine")

    def name(self):
        return "knn"
    
    def fit(self, training=False):
        X = self.dataset.get_train_data()
        self.model.fit(X)

    def predict(self, X, **kwargs):
        if X.count_nonzero() == 0:
            return np.random.uniform(size=X.shape)
    
        distances, indices = self.model.kneighbors(X)
        
        distances = distances[:, 1:]
        indices = indices[:, 1:]
        
        distances = 1 - distances
        sums = distances.sum(axis=1)
        distances = distances / sums[:, np.newaxis]

        def f(dist, idx):
            A = self.dataset.get_train_data()[idx]
            D = sp.diags(dist)
            return D.dot(A).sum(axis=0)

        vf = np.vectorize(f, signature="(n),(n)->(m)")
       
        predictions = vf(distances, indices)
        predictions[X.nonzero()] = 0

        return predictions

You must define the fit method to train your model using the training data or load the previously trained model from a file. All models are fitted when the web application starts, or the evaluation process begins. If this is not a training phase, always load your model from a checkpoint to speed up the process. For tutorial purposes, this is omitted.

You must also define the prediction method that receives a sparse matrix of the users' interactions on the input. For each user (row of the matrix) and item (column of the matrix), the method should return a predicted score indicating how much the user will enjoy the item.

Additionally, you can specify some web application parameters you can set during recommender creation. The value is then accessible in the **kwargs argument of the prediction method. In the example, we create a select input with all unique genres and filter out only those movies that do not contain the selected genre.

repsys.ini

The last file we should create is a configuration that allows you to control a data splitting process, server settings, framework behavior, etc.

[general]
seed=1234

[dataset]
train_split_prop=0.85
test_holdout_prop=0.2
min_user_interacts=5
min_item_interacts=0

[evaluation]
precision_recall_k=20,50
ndcg_k=100
coverage_k=20
diversity_k=20
novelty_k=20
percentage_lt_k=20
coverage_lt_k=20

[visualization]
embed_method=pymde
pymde_neighbors=15
umap_neighbors=15
umap_min_dist=0.1
tsne_perplexity=30

[server]
port=3001

Splitting the Data

Before we train our models, we need to split the data into train, validation, and test sets. Run the following command from the current directory.

$ repsys dataset split

This will hold out 85% of the users as training data, and the rest 15% will be used as validation/test data with 7.5% of users each. For both validation and test set, 20% of the interactions will also be held out for evaluation purposes. The split dataset will be stored in the default checkpoints folder.

Training the Models

Now we can move to the training process. To do this, please call the following command.

$ repsys model train

This command will call the fit method of each model with the training flag set to true. You can always limit the models using -m flag with the model's name as a parameter.

Evaluating the Models

When the data is prepared and the models trained, we can evaluate the performance of the models on the unseen users' interactions. Run the following command to do so.

$ repsys model eval

Again, you can limit the models using the -m flag. The results will be stored in the checkpoints folder when the evaluation is done.

Evaluating the Dataset

Before starting the web application, the final step is to evaluate the dataset's data. This procedure will create users and items embeddings of the training and validation data to allow you to explore the latent space. Run the following command from the project directory.

$ repsys dataset eval

You can choose from three types of embeddings algorithm:

  1. UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) is a dimensionality reduction technique similar to t-SNE. Use --method umap (this is the default option).
  2. PyMDE (Minimum-Distortion Embedding) is a fast library designed to distort relationships between pairs of items minimally. Use --method pymde.
  3. Combination of the PCA and TSNE algorithms (reduction of the dimensionality to 50 using PCA, then reduction to 2D space using TSNE). Use --method tsne.
  4. Your own implementation of the algorithm. Use --method custom and add the following method to the model's class of your choice. In this case, you must also specify the model's name using -m parameter.
from sklearn.decomposition import NMF

def compute_embeddings(self, X):
    nmf = NMF(n_components=2)
    W = nmf.fit_transform(X)
    H = nmf.components_
    return W, H.T

In the example, the negative matrix factorization is used. You have to return a user and item embeddings pair in this order. Also, it is essential to return the matrices in the shape of (n_users/n_items, n_dim). If the reduced dimension is higher than 2, the TSNE method is applied.

Running the Application

Finally, it is time to start the web application to see the results of the evaluations and preview live recommendations of your models.

$ repsys server

The application should be accessible on the default address http://localhost:3001. When you open the link, you will see the main screen where your recommendations appear once you finish the setup. The first step is defining how the items' data columns should be mapped to the item view components.

app setup

Then we need to switch to the build mode and add two recommenders - one without filter and the second with only comedy movies included.

add recommender

Now we switch back from the build mode and select a user from the validation set (never seen by a model before).

user select

Finally, we see the user's interaction history on the right side and the recommendations made by the model on the left side.

user select

Contributing

To build the package from the source, you first need to install Node.js and npm library as documented here. Then you can run the following script from the root directory to build the web application and install the package locally.

$ ./scripts/install-locally.sh

Citation

If you employ RepSys in your research work, please do not forget to cite the related paper:

@inproceedings{10.1145/3523227.3551469,
  author = {\v{S}afa\v{r}\'{\i}k, Jan and Van\v{c}ura, Vojt\v{e}ch and Kord\'{\i}k, Pavel},
  title = {RepSys: Framework for Interactive Evaluation of Recommender Systems},
  year = {2022},
  isbn = {9781450392785},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3523227.3551469},
  doi = {10.1145/3523227.3551469},
  booktitle = {Proceedings of the 16th ACM Conference on Recommender Systems},
  pages = {636–639},
  numpages = {4},
  keywords = {User simulation, Distribution analysis, Recommender systems},
  location = {Seattle, WA, USA},
  series = {RecSys '22}
}

The Team

Sponsoring

The development of this framework is sponsored by the Recombee company.

repsys's People

Contributors

cowjen01 avatar kasape avatar zombak79 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

repsys's Issues

Docker build: When using COPY with more than one source file, the destination must be a directory and end with a /

For Docker version 24.0.2, build cb74dfcd85, the command docker build -t repsys . crashes with:

DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  56.84MB
Step 1/18 : FROM python:3.7-slim AS compile-image
 ---> 70ec6f685b6f
Step 2/18 : RUN apt-get update
 ---> Using cache
 ---> 4336b4f0e4ec
Step 3/18 : RUN apt-get install -y --no-install-recommends build-essential gcc
 ---> Using cache
 ---> 2ece45428470
Step 4/18 : RUN python -m venv /opt/venv
 ---> Using cache
 ---> 5bd004fb4096
Step 5/18 : ENV PATH="/opt/venv/bin:$PATH"
 ---> Using cache
 ---> e76d80c1d545
Step 6/18 : COPY requirements.txt .
 ---> Using cache
 ---> 385bcbfe53a7
Step 7/18 : RUN pip install --upgrade pip     pip install -r requirements.txt
 ---> Using cache
 ---> c50d35c340de
Step 8/18 : COPY setup.py MANIFEST.in pyproject.toml LICENSE.txt README.md .
When using COPY with more than one source file, the destination must be a directory and end with a /

This can be fixed by changing the line to

COPY setup.py MANIFEST.in pyproject.toml LICENSE.txt README.md ./

Update numpy and scipy to support Python 3.11

For python 3.11, command:

pip install repsys-framework==0.3.8

ends with:

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement numpy==1.21.5 (from repsys-framework) (from versions: 1.3.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.10.0.post2, 1.10.1, 1.10.2, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 1.13.3, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6, 1.17.0, 1.17.1, 1.17.2, 1.17.3, 1.17.4, 1.17.5, 1.18.0, 1.18.1, 1.18.2, 1.18.3, 1.18.4, 1.18.5, 1.19.0, 1.19.1, 1.19.2, 1.19.3, 1.19.4, 1.19.5, 1.20.0, 1.20.1, 1.20.2, 1.20.3, 1.21.0, 1.21.1, 1.22.0, 1.22.1, 1.22.2, 1.22.3, 1.22.4, 1.23.0rc1, 1.23.0rc2, 1.23.0rc3, 1.23.0, 1.23.1, 1.23.2, 1.23.3, 1.23.4, 1.23.5, 1.24.0rc1, 1.24.0rc2, 1.24.0, 1.24.1, 1.24.2, 1.24.3, 1.24.4, 1.25.0rc1, 1.25.0, 1.25.1, 1.25.2)
ERROR: No matching distribution found for numpy==1.21.5

Feature proposal: check datatype of values predicted by implemented models

There should be a clear interface of models expected from RepSys that will be enforced.
Currently, when a developer implements a model that predicts something other than np.ndarray (like sparse csr_matrix or np.matrix), repsys crashes due to some mishapes in the following methods and it's hard to find out what went wrong.

ImportError: cannot import name 'CLOSED' from 'websockets.connection'

The docker image was built with requirements.txt that do not specify the version of the websockets library, so the current version 11.0.3 was installed:

Requirement already satisfied: websockets>=10.0 in /opt/venv/lib/python3.7/site-packages (from sanic==21.9.3->repsys-framework==0.3.6) (11.0.3)

but it no longer contains the CLOSED state for connections. When running the docker container (calling repsys server), the following error appeared:

Traceback (most recent call last):
  File "/opt/venv/bin/repsys", line 5, in <module>
    from repsys.__main__ import main
  File "/opt/venv/lib/python3.7/site-packages/repsys/__main__.py", line 4, in <module>
    from repsys.cli import repsys_group
  File "/opt/venv/lib/python3.7/site-packages/repsys/cli.py", line 9, in <module>
    from repsys.core import (
  File "/opt/venv/lib/python3.7/site-packages/repsys/core.py", line 8, in <module>
    from repsys.server import run_server
  File "/opt/venv/lib/python3.7/site-packages/repsys/server.py", line 7, in <module>
    from sanic import Sanic
  File "/opt/venv/lib/python3.7/site-packages/sanic/__init__.py", line 2, in <module>
    from sanic.app import Sanic
  File "/opt/venv/lib/python3.7/site-packages/sanic/app.py", line 77, in <module>
    from sanic.server.protocols.websocket_protocol import WebSocketProtocol
  File "/opt/venv/lib/python3.7/site-packages/sanic/server/protocols/websocket_protocol.py", line 3, in <module>
    from websockets.connection import CLOSED, CLOSING, OPEN
ImportError: cannot import name 'CLOSED' from 'websockets.connection' (/opt/venv/lib/python3.7/site-packages/websockets/connection.py)

Adding websockets==10.0 to requirements fixes the issue:

Requirement already satisfied: websockets>=10.0 in /opt/venv/lib/python3.7/site-packages (from sanic==21.9.3->repsys-framework==0.3.6) (10.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.