Coder Social home page Coder Social logo

sb-ai-lab / sim4rec Goto Github PK

View Code? Open in Web Editor NEW
44.0 6.0 1.0 7.85 MB

Simulator for training and evaluation of Recommender Systems

Home Page: https://sb-ai-lab.github.io/Sim4Rec/

License: Apache License 2.0

Python 1.32% Jupyter Notebook 98.68%
evaluation-framework recommendation recommender-system rl-training synthetic-data user-modeling

sim4rec's Introduction

Simulator

Simulator is framework for training and evaluating recommendation algorithms on real or synthetic data. Framework is based on pyspark library to work with big data. As a part of simulation process the framework incorporates data generators, response functions and other tools, that can provide flexible usage of simulator.

Table of contents

Installation

pip install sim4rec

If the installation takes too long, try

pip install sim4rec --use-deprecated=legacy-resolver

To install dependencies with poetry run

pip install --upgrade pip wheel poetry
poetry install

Quickstart

The following example shows how to use simulator to train model iteratively by refitting recommendation algorithm on the new upcoming history log

import numpy as np
import pandas as pd

import pyspark.sql.types as st
from pyspark.ml import PipelineModel
from sim4rec.utils import pandas_to_spark
from sim4rec.modules import RealDataGenerator, Simulator
from sim4rec.response import NoiseResponse, BernoulliResponse

from ucb import UCB
from replay.metrics import NDCG

LOG_SCHEMA = st.StructType([
    st.StructField('user_idx', st.LongType(), True),
    st.StructField('item_idx', st.LongType(), True),
    st.StructField('relevance', st.DoubleType(), False),
    st.StructField('response', st.IntegerType(), False)
])

users_df = pd.DataFrame(
    data=np.random.normal(0, 1, size=(100, 15)),
    columns=[f'user_attr_{i}' for i in range(15)]
)
items_df = pd.DataFrame(
    data=np.random.normal(1, 1, size=(30, 10)),
    columns=[f'item_attr_{i}' for i in range(10)]
)
history_df = pandas_to_spark(pd.DataFrame({
    'user_idx' : [1, 10, 10, 50],
    'item_idx' : [4, 25, 26, 25],
    'relevance' : [1.0, 0.0, 1.0, 1.0],
    'response' : [1, 0, 1, 1]
}), schema=LOG_SCHEMA)

users_df['user_idx'] = np.arange(len(users_df))
items_df['item_idx'] = np.arange(len(items_df))

users_df = pandas_to_spark(users_df)
items_df = pandas_to_spark(items_df)

user_gen = RealDataGenerator(label='users_real')
item_gen = RealDataGenerator(label='items_real')
user_gen.fit(users_df)
item_gen.fit(items_df)
_ = user_gen.generate(100)
_ = item_gen.generate(30)

sim = Simulator(
    user_gen=user_gen,
    item_gen=item_gen,
    data_dir='test_simulator',
    user_key_col='user_idx',
    item_key_col='item_idx',
    log_df=history_df
)

noise_resp = NoiseResponse(mu=0.5, sigma=0.2, outputCol='__noise')
br = BernoulliResponse(inputCol='__noise', outputCol='response')
pipeline = PipelineModel(stages=[noise_resp, br])

model = UCB()
model.fit(log=history_df)

ndcg = NDCG()

train_ndcg = []
for i in range(10):
    users = sim.sample_users(0.1).cache()

    recs = model.predict(log=sim.log, k=5, users=users, items=items_df, filter_seen_items=True).cache()

    true_resp = sim.sample_responses(
        recs_df=recs,
        user_features=users,
        item_features=items_df,
        action_models=pipeline
    ).select('user_idx', 'item_idx', 'relevance', 'response').cache()

    sim.update_log(true_resp, iteration=i)

    train_ndcg.append(ndcg(recs, true_resp.filter(true_resp['response'] >= 1), 5))

    model.fit(sim.log.drop('relevance').withColumnRenamed('response', 'relevance'))

    users.unpersist()
    recs.unpersist()
    true_resp.unpersist()

print(train_ndcg)

Examples

You can find useful examples in notebooks folder, which demonstrates how to use synthetic data generators, composite generators, evaluate scores of the generators, iteratively refit recommendation algorithm, use response functions and more.

Experiments with different datasets and tutorial how to write custom response functions can be found in 'experiments' folder.

Build from sources

poetry build
pip install ./dist/sim4rec-0.0.1-py3-none-any.whl

Compile documentation

cd docs
make clean && make html

Tests

For tests the pytest python library is used and to run tests for all modules you can run the following command from repository root directory

pytest

Licence

Sim4Rec is distributed under the Apache License Version 2.0, nevertheless the SDV package, imported by the Sim4Rec for synthetic data generation, is distributed under Business Source License (BSL) 1.1.

Synthetic tabular data generation not a purpose of the Sit4Rec framework. The Sim4Rec offers an API and wrappers to run simulation with synthetic data, but the method of synthetic data generation is determined by the user. SDV package is imported for illustration purposes and may be replaced by another synthetic data generation solution.

Thus, synthetic data generation functional and quality evaluation with SDV library, namely the SDVDataGenerator from generator.py and evaluate_synthetic from evaluation.py should be used for non-production purposes only according to the SDV License.

sim4rec's People

Contributors

alexxl1 avatar alexxl1986 avatar blinkop avatar karimdzan avatar monkey0head avatar niksukhorukov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sandy4321

sim4rec's Issues

Type check fails for complex types

If complex type is present in schema (e.g. ArrayType), it could not be compared using symmetric_difference as done here

def _check_names_and_types(df1_schema, df2_schema):
as the type is not hashable.
Need to rework this code part to compare only names and types (do not compare nullability) and be able to work with complex types.
Proposed: sort two lists with (name, DataType) as done now and compare columns one-by-one. Raise error on the first difference.

Add test showing the functionality works for simple and complex types.

Append support for Python 3.10 and higher

Hello!

Please, append support for latest versions of Python 3.10 and 3.11 for sim4rec packages. Python 3.10 is a default binary for Google Colab, so it is impossible to install and use it in Google colab, only locally with lower versions.

Error from Google Colab:
scan from colaboratory

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.