Coder Social home page Coder Social logo

st-tech / zr-obp Goto Github PK

View Code? Open in Web Editor NEW
631.0 88.0 87.0 29.34 MB

Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

License: Apache License 2.0

Python 100.00%
datasets off-policy-evaluation contextual-bandits multi-armed-bandits research

zr-obp's Introduction

pypi Python Downloads GitHub commit activity GitHub last commit License arXiv

[arXiv] [NeurIPS2021 Proceedings]

Open Bandit Pipeline: a research framework for off-policy evaluation and learning

Docs | Google Group | Tutorial | Installation | Usage | Slides | Quickstart | Open Bandit Dataset | 日本語

Table of Contents

Overview

Open Bandit Dataset (OBD)

Open Bandit Dataset is a public real-world logged bandit dataset. This dataset is provided by ZOZO, Inc., the largest fashion e-commerce company in Japan. The company uses some multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents the displayed fashion items as actions where there are three positions in the recommendation interface.

Recommended fashion items as actions in the ZOZOTOWN recommendation interface

The dataset was collected during a 7-day experiment on three “campaigns,” corresponding to all, men's, and women's items, respectively. Each campaign randomly used either the Uniform Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy for the data collection. Open Bandit Dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform. This enables realistic and reproducible experimental comparisons of different OPE estimators for the first time (see Section 5 of the reference paper for the details of the evaluation of OPE protocol using Open Bandit Dataset).

The small size version of our data is available at obd. We release the full size version of our data at https://research.zozo.com/data.html. Please download the full size version for research uses. Please also see obd/README.md for the detailed dataset description.

Open Bandit Pipeline (OBP)

Open Bandit Pipeline is an open-source Python software including a series of modules for implementing dataset preprocessing, policy learning methods, and OPE estimators. Our software provides a complete, standardized experimental procedure for OPE research, ensuring that performance comparisons are fair and reproducible. It also enables fast and accurate OPE implementation through a single unified interface, simplifying the practical use of OPE.

Overview of the Open Bandit Pipeline

Open Bandit Pipeline consists of the following main modules.

  • dataset module: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit data. It also provides tools to generate synthetic bandit data and transform multi-class classification data to bandit data.
  • policy module: This module provides interfaces for implementing new online and offline bandit policies. It also implements several standard policy learning methods.
  • simulator module: This module provides functions for conducting offline bandit simulation. This module is necessary only when you use the ReplayMethod to evaluate online bandit policies. Please refer to examples/quickstart/online.ipynb for a quickstart guide of implementing OPE of online bandit algorithms.
  • ope module: This module provides generic abstract interfaces to support custom implementations so that researchers can evaluate their own estimators easily. It also implements several basic and advanced OPE estimators.

Supported Bandit Algorithms and OPE Estimators

Bandit Algorithms (click to expand)
OPE Estimators (click to expand)

Please refer to Section 2 and the Appendix of the reference paper for the standard formulation of OPE and the definitions of a range of OPE estimators. Note that, in addition to the above algorithms and estimators, Open Bandit Pipeline provides flexible interfaces. Therefore, researchers can easily implement their own algorithms or estimators and evaluate them with our data and pipeline. Moreover, Open Bandit Pipeline provides an interface for handling real-world logged bandit data. Thus, practitioners can combine their own real-world data with Open Bandit Pipeline and easily evaluate bandit algorithms' performance in their settings with OPE.

Installation

You can install OBP using Python's package manager pip.

pip install obp

You can also install OBP from source.

git clone https://github.com/st-tech/zr-obp
cd zr-obp
python setup.py install

Open Bandit Pipeline supports Python 3.7 or newer. See pyproject.toml for other requirements.

Usage

Example with Synthetic Bandit Data

Here is an example of conducting OPE of the performance of IPWLearner as an evaluation policy using Direct Method (DM), Inverse Probability Weighting (IPW), Doubly Robust (DR) as OPE estimators.

# implementing OPE of the IPWLearner using synthetic bandit data
from sklearn.linear_model import LogisticRegression
# import open bandit pipeline (obp)
from obp.dataset import SyntheticBanditDataset
from obp.policy import IPWLearner
from obp.ope import (
    OffPolicyEvaluation,
    RegressionModel,
    InverseProbabilityWeighting as IPW,
    DirectMethod as DM,
    DoublyRobust as DR,
)

# (1) Generate Synthetic Bandit Data
dataset = SyntheticBanditDataset(n_actions=10, reward_type="binary")
bandit_feedback_train = dataset.obtain_batch_bandit_feedback(n_rounds=1000)
bandit_feedback_test = dataset.obtain_batch_bandit_feedback(n_rounds=1000)

# (2) Off-Policy Learning
eval_policy = IPWLearner(n_actions=dataset.n_actions, base_classifier=LogisticRegression())
eval_policy.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"]
)
action_dist = eval_policy.predict(context=bandit_feedback_test["context"])

# (3) Off-Policy Evaluation
regression_model = RegressionModel(
    n_actions=dataset.n_actions,
    base_model=LogisticRegression(),
)
estimated_rewards_by_reg_model = regression_model.fit_predict(
    context=bandit_feedback_test["context"],
    action=bandit_feedback_test["action"],
    reward=bandit_feedback_test["reward"],
)
ope = OffPolicyEvaluation(
    bandit_feedback=bandit_feedback_test,
    ope_estimators=[IPW(), DM(), DR()]
)
ope.visualize_off_policy_estimates(
    action_dist=action_dist,
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
)

Performance of IPWLearner estimated by OPE

A formal quickstart example with synthetic bandit data is available at examples/quickstart/synthetic.ipynb. We also prepare a script to conduct the evaluation of OPE experiment with synthetic bandit data in examples/synthetic.

Example with Multi-Class Classification Data

Researchers often use multi-class classification data to evaluate the estimation accuracy of OPE estimators. Open Bandit Pipeline facilitates this kind of OPE experiments with multi-class classification data as follows.

# implementing an experiment to evaluate the accuracy of OPE using classification data
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# import open bandit pipeline (obp)
from obp.dataset import MultiClassToBanditReduction
from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data Loading and Bandit Reduction
X, y = load_digits(return_X_y=True)
dataset = MultiClassToBanditReduction(X=X, y=y, base_classifier_b=LogisticRegression(random_state=12345))
dataset.split_train_eval(eval_size=0.7, random_state=12345)
bandit_feedback = dataset.obtain_batch_bandit_feedback(random_state=12345)

# (2) Evaluation Policy Derivation
# obtain action choice probabilities of an evaluation policy
action_dist = dataset.obtain_action_dist_by_eval_policy(base_classifier_e=RandomForestClassifier(random_state=12345))
# calculate the ground-truth performance of the evaluation policy
ground_truth = dataset.calc_ground_truth_policy_value(action_dist=action_dist)
print(ground_truth)
0.9634340222575517

# (3) Off-Policy Evaluation and Evaluation of OPE
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
# evaluate the estimation performance (accuracy) of IPW by the relative estimation error (relative-ee)
relative_estimation_errors = ope.evaluate_performance_of_estimators(
        ground_truth_policy_value=ground_truth,
        action_dist=action_dist,
        metric="relative-ee",
)
print(relative_estimation_errors)
{'ipw': 0.01827255896321327} # the accuracy of IPW in OPE

A formal quickstart example with multi-class classification data is available at examples/quickstart/multiclass.ipynb. We also prepare a script to conduct the evaluation of OPE experiment with multi-class classification data in examples/multiclass.

Example with Open Bandit Dataset

Here is an example of conducting OPE of the performance of BernoulliTS as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit data generated by the Random policy (behavior policy) on the ZOZOTOWN platform.

# implementing OPE of the BernoulliTS policy using log data generated by the Random policy
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data Loading and Preprocessing
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
bandit_feedback = dataset.obtain_batch_bandit_feedback()

# (2) Production Policy Replication
evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
print(relative_policy_value_of_bernoulli_ts)
1.198126...

A formal quickstart example with Open Bandit Dataset is available at examples/quickstart/obd.ipynb. We also prepare a script to conduct the evaluation of OPE using Open Bandit Dataset in examples/obd. Please see our documentation for the details of the evaluation of OPE protocol based on Open Bandit Dataset.

Citation

If you use our dataset and pipeline in your work, please cite our paper:

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita.
Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation
https://arxiv.org/abs/2008.07146

Bibtex:

@article{saito2020open,
  title={Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation},
  author={Saito, Yuta and Shunsuke, Aihara and Megumi, Matsutani and Yusuke, Narita},
  journal={arXiv preprint arXiv:2008.07146},
  year={2020}
}

The paper has been accepted at NeurIPS2021 Datasets and Benchmarks Track. The camera-ready version of the paper is available here.

Sister Package: pyIEOE

In addition to OBP, we develop a Python package called pyIEOE, which allows practitioners to easily evaluate and compare the robustness of OPE estimators.

Please also see the following reference paper about IEOE (accepted at RecSys'21).

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno.
Evaluating the Robustness of Off-Policy Evaluation
https://arxiv.org/abs/2108.13703

Google Group

If you are interested in the Open Bandit Project, please follow its updates via the google group: https://groups.google.com/g/open-bandit-project

Contribution

Any contributions to Open Bandit Pipeline are more than welcome! Please refer to CONTRIBUTING.md for general guidelines how to contribute to the project.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project Team

Developers

Contact

For any question about the paper, data, and pipeline, feel free to contact: [email protected]

References

Papers (click to expand)
  1. Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining, 129–138, 2009.

  2. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, 2249–2257, 2011.

  3. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, 297–306, 2011.

  4. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, 2217–2225, 2010.

  5. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. 2000.

  6. Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.

  7. Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, 3231–3239, 2015.

  8. Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. LogUCB: An Explore-Exploit Algorithm for Comments Recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, 6–15. 2012.

  9. Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, volume 26, 19–36. 2012.

  10. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, 3589–3597. 2017.

  11. Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. 2018.

  12. Nathan Kallus and Masatoshi Uehara. Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2019.

  13. Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning. In Proceedings of the 36th International Conference on Machine Learning, 6005-6014, 2019.

  14. Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust Off-policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, 9167-9176, 2020.

  15. Nathan Kallus and Angela Zhou. Policy Evaluation and Optimization with Continuous Treatments. In International Conference on Artificial Intelligence and Statistics, 1243–1251. PMLR, 2018.

  16. Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. In Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining, 687–696, 2017.

  17. Nathan Kallus, Yuta Saito, and Masatoshi Uehara. Optimal Off-Policy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, 5247-5256, 2021.

  18. Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining, 1685–1694, 2018.

  19. James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining, 1779–1788, 2020.

  20. Yusuke Narita, Shota Yasui, and Kohei Yata. Debiased Off-Policy Evaluation for Recommendation Systems. In Proceedings of the Fifteenth ACM Conference on Recommender Systems, 372-379, 2021.

  21. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems. 2020.

  22. Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy Bandits with Deficient Support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 965-975, 2021.

  23. Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the 38th International Conference on Machine Learning, 9196-9205, 2021.

  24. Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 487-497, 2022.

  25. Yuta Saito and Thorsten Joachims. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning, 2022.

Projects (click to expand)

The Open Bandit Project is strongly inspired by Open Graph Benchmark --a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].

zr-obp's People

Contributors

aiueola avatar daturkel avatar exkazuu avatar fullflu avatar ikenyal avatar kkmogi avatar kurorororo avatar minyus avatar nomuramasahir0 avatar usaito avatar ziminpark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zr-obp's Issues

context vector dimensions are different for data sampled with random policy and those sampled with Bernoulli TS

Context vector dimensions are different for data sampled using uniform random policy and the data sampled using Bernoulli TS.
context vector for the data sampled with Bernoulli TS policy has dimension of 22
whereas context vector for the data sampled with random policy has dimension of 20

I was not able to find any description on what the missing 2 dimensions in the context vectors sampled with random policy are representing.
What should I do to match their dimensions?

###############################################################################
Codes that I ran:
###############################################################################
from obp.dataset import OpenBanditDataset

dataset_random = OpenBanditDataset(behavior_policy="random", campaign="all")
dataset_bts = OpenBanditDataset(behavior_policy="bts",    campaign="all")
bandit_feedback_random = dataset_random.obtain_batch_bandit_feedback()
bandit_feedback_bts = dataset_bts.obtain_batch_bandit_feedback()

bandit_feedback_random['context'].shape
>> (10000, 20)
bandit_feedback_bts['context'].shape
>> (10000, 22)
###############################################################################

Any specific code formatter?

Hi!
I've found some code implementations that could be improved and I'd like to create a PR.
Regarding the question related to that, is there any specification of code formatter (e.g., black)?
If you can tell us, I will create a PR that matches it.

RegressionModel should use bandit_feedback_test in the quickstart notebook for the synthetic data

Since regression_model should predict reward values for test data, regression_model.fit_predict should take bandit_feedback_test instead of bandit_feedback_train, which is the training data for the evaluation policy. The current notebook does not raise errors because the sizes of bandit_feedback_train and bandit_feedback_test are the same.

"estimated_rewards_by_reg_model = regression_model.fit_predict(\n",
" context=bandit_feedback_train[\"context\"],\n",
" action=bandit_feedback_train[\"action\"],\n",
" reward=bandit_feedback_train[\"reward\"],\n",
" n_folds=3, # use 3-fold cross-fitting\n",
" random_state=12345,\n",
")"

The example script correctly uses bandit_feedback_test for regression_model.fit_predict.

estimated_rewards_by_reg_model = regression_model.fit_predict(
context=bandit_feedback_test["context"],
action=bandit_feedback_test["action"],
reward=bandit_feedback_test["reward"],
n_folds=3, # 3-fold cross-fitting
random_state=random_state,
)

`obp.dataset.BaseBanditDataset` class does not exist

Reading the (1) Data loading and preprocessing section of ./README.md, the following sentence gives me the impression that obp.dataset.BaseBanditDataset class exists.

Moreover, by following the interface of obp.dataset.BaseBanditDataset class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset. dataset module also provide a class to generate synthetic bandit datasets.

In the obp.dataset.base.py, however, only BaseRealBanditDataset and BaseSyntheticBanditDatasetclasses are defined.

Are you planning to implement obp.dataset.BaseBanditDataset class?
If not, I propose that obp.dataset.BaseBanditDataset be replaced with obp.dataset.BaseRealBanditDataset or obp.dataset.BaseSyntheticBanditDataset in the ./README.md.

Self Normalized Estimator _estimate_round_rewards is wrong?

in SelfNormalizedInverseProbabilityWeighting._estimate_round_rewards, what is returned in the denominator is iw.mean() when in fact this should be is iw.sum(). I think this computation affects the computation of the confidence intervals for this class.

Found this issue when i found that the SNIPS estimator had unusually higher variance than the IPW estimator.

This means that _estimate_policy_value in InverseProbabilityWeighting (the base class) may need to be changed as well, since the return for that is .mean(), and there is no such normalizing constant in the definition of the SNIPS estimator.

"action_context" arrays in bandit feedbacks of different policies have different values

I checked "action_context" arrays in both bandit_feedbacks acquired with random and Bernoulli TS policies and they are different. Why are they different?
Do they use different values to describe a same feature except the last column which are filled with real values?
below is the code that I ran to check:

###############################################################################
from obp.dataset import OpenBanditDataset

dataset_random = OpenBanditDataset(behavior_policy="random", campaign="all")
dataset_bts = OpenBanditDataset(behavior_policy="bts",    campaign="all")
bandit_feedback_random = dataset_random.obtain_batch_bandit_feedback()
bandit_feedback_bts = dataset_bts.obtain_batch_bandit_feedback()

np.sum(np.abs(bandit_feedback_random['action_context'] - bandit_feedback_bts['action_context']))  # they don't give zero
>> 1143.0
##############################################################################

Citation format lacks uniformity

README.md and README_JN.md

  1. author={Saito, Yuta and Shunsuke Aihara and Megumi Matsutani and Yusuke Narita},

docs/index.rst and obp/README.md

  1. author={Saito, Yuta, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita},

arxiv

  1. author={Yuta Saito and Shunsuke Aihara and Megumi Matsutani and Yusuke Narita},

other possible options

  1. author={Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita},
  2. author={Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita},

Question

Which one would be better? (I’ll fix them along with other typos)

"action_context" arrays in bandit feedbacks & item_context.csv (downloaded from https://research.zozo.com/data.html#OpenBanditDataset)

In my under standing, both "action_context" and item_context.csv represents each action (total 80 actions) in 4 dimensional vectors. (80 actions, 4 dimensional vectors to describe each of them => (80,4) array)

But it seems that the action_context and item_context.csv have different values. Why is it so?

The first column (real values) of item_context.csv seems to match the last column of action_context

2nd to 4th columns of item_context.csv seems to be in 1st to 3rd columns of action_context with different values assigned to describe a same feature of an item (or action)

By "action_context" I mean bandit_feedback_random['action_context'], bandit_feedback_bts['action_context'] from below

###############################################################################
from obp.dataset import OpenBanditDataset

dataset_random = OpenBanditDataset(behavior_policy="random", campaign="all")
dataset_bts = OpenBanditDataset(behavior_policy="bts",    campaign="all")
bandit_feedback_random = dataset_random.obtain_batch_bandit_feedback()
bandit_feedback_bts = dataset_bts.obtain_batch_bandit_feedback()
 
np.sum(np.abs(bandit_feedback_random['action_context'] - bandit_feedback_bts['action_context'])) # they don't give 0
#   returns : 1143.0
##############################################################################

alpha_ and lambda_ are not necessary for contextual linear bandit algorithms

Currently, contextual linear and logistic bandit algorithms share the same superclass BaseContextualPolicy.
The constructor of BaseContextualPolicy has alpha_ and lambda_ as arguments:

zr-obp/obp/policy/base.py

Lines 93 to 129 in c9ad20c

@dataclass
class BaseContextualPolicy(metaclass=ABCMeta):
"""Base class for contextual bandit policies.
Parameters
----------
dim: int
Number of dimensions of context vectors.
n_actions: int
Number of actions.
len_list: int, default=1
Length of a list of actions recommended in each impression.
When Open Bandit Dataset is used, 3 should be set.
batch_size: int, default=1
Number of samples used in a batch parameter update.
alpha_: float, default=1.
Prior parameter for the online logistic regression.
lambda_: float, default=1.
Regularization hyperparameter for the online logistic regression.
random_state: int, default=None
Controls the random seed in sampling actions.
"""
dim: int
n_actions: int
len_list: int = 1
batch_size: int = 1
alpha_: float = 1.0
lambda_: float = 1.0
random_state: Optional[int] = None

These arguments are used to initialize self.alpha_list and self.lambda_list, which are used by LogisticEpsilonGreedy, LogisticTS, and LogisticUCB but not used by LinearEpsilonGreedy, LinTS, and LinUCB.
I suggest moving alpha_, lambda_, self.alpha_list, and self.lambda_list to another class, BaseLogisticPolicy for example, and making logistic policies inherit this new class.

Parameter eval_size in MultiClassToBanditReduction.obtain_batch_bandit_feedback exists only in the docstring

Although the docstring of MultiClassToBanditReduction.obtain_batch_bandit_feedback describes parameter eval_size, it does not exists. I guess that the docstring should be modified.

def obtain_batch_bandit_feedback(
self,
random_state: Optional[int] = None,
) -> BanditFeedback:
"""Obtain batch logged bandit feedback, an evaluation policy, and its ground-truth policy value.
Note
-------
Please call `self.split_train_eval()` before calling this method.
Parameters
-----------
eval_size: float or int, default=0.25
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
If int, represents the absolute number of test samples.

We cannot set integers to the hyperparameters of Switch-DR and DRos

When we try to use integer values for the hypeparameters of Switch-DR and DRos (e.g., tau=100), we happen to see a value error as follows.

In Switch-DR,

zr-obp/obp/ope/estimators.py

Lines 1101 to 1104 in f03aeb1

if not isinstance(self.tau, float):
raise ValueError(
f"switching hyperparameter must be float, but {self.tau} is given"
)

In DR-os,

zr-obp/obp/ope/estimators.py

Lines 1224 to 1227 in f03aeb1

if not isinstance(self.lambda_, float):
raise ValueError(
f"shrinkage hyperparameter must be float, but {self.lambda_} is given"
)

However, I think we can allow integer values to be used as these hyperparemeters.

Improperly escaped docstrings

https://github.com/st-tech/zr-obp/blob/master/obp/dataset/real.py#L231

https://github.com/st-tech/zr-obp/blob/master/obp/dataset/real.py#L320

These Docstrings are not properly escaped. This means we are getting errors when trying to import this package into python 3.10.

For example trying to import OBP within a pytest run I get

E     File "<my_repo>/lib/python3.10/site-packages/obp/dataset/real.py", line 213
E       """Obtain batch logged bandit data.
E       ^^^
E   SyntaxError: invalid escape sequence '\m'

Current Read the Docs is not the latest version

The current Read the Docs is not the latest version, so it seems that it needs to be modified.
In fact, the following contents are inconsistent.

class BaseBanditDataset(metaclass=ABCMeta):
"""Base Class for Synthetic Bandit Dataset."""
@abstractmethod
def obtain_batch_bandit_feedback(self) -> None:
"""Obtain batch logged bandit feedback."""
raise NotImplementedError

If the webhook setting of GitHub Repository is successful, Read the Docs side should build it automatically, so I am concerned that this setting is not correct.

Question: What is the source of the construction of evaluation and behavior policies?

Hi zr-obp team.

In the multiclass.py I see evaluation and behavior policies are as follows:

# construct an evaluation policy
        pi_e = np.zeros((self.n_rounds_ev, self.n_actions))
        pi_e[:, :] = (1.0 - alpha_e) / self.n_actions
        pi_e[np.arange(self.n_rounds_ev), preds] = (
            alpha_e + (1.0 - alpha_e) / self.n_actions
        )
# construct a behavior policy
        pi_b = np.zeros((self.n_rounds_ev, self.n_actions))
        pi_b[:, :] = (1.0 - self.alpha_b) / self.n_actions
        pi_b[np.arange(self.n_rounds_ev), preds] = (
            self.alpha_b + (1.0 - self.alpha_b) / self.n_actions
        )

Which paper does this follow from?

Also, have you tried the following from 20-Su+ paper?
Screen Shot 2021-09-23 at 11 09 50 PM

Thanks!

Question/Feature: Is it possible to add the 10 UCI datasets as well for evaluating these OPEs?

Hello OBP team.
From a quick tour of this Github page, I found that this might be more suitable for my research (testing and adding new ope) than using the massive Vowpal Wabbit! Kudos to maintaining a very good repo! :)

I see that the multiclass type dataset from sklearn has been added.
Is it possible to include the 10 UCI datasets in the examples (with corresponding working hyperparameters) that are used in many papers that you cite (like Su+ 20, Wang+ 17, Dudik+ 14)?

New simulation functionality

Hi! You presented a paper regarding simulation of industrial challenges with OBP at the RecSys'22. I found it very interesting and want to understand some details. I found the pr with source code and notebooks with experiments, but did not find a documentation describing the idea and details of the new functionality. So, could you help me with some questions:

  1. What kind of reward functions do you have and how are they trained? I found logistic_reward_function, linear_reward_function and others placed here. Unfortunately I have not realised what it the training data for them and if they are retrained every simulation round.
  2. What is the functionality of BanditEnvironmentSimulator?

It would be great if you can share some papers (except for the one from RecSys'22), schemas, demos, tutorials explaining your simulation framework details.

How to use kernelized inverse probability weighting with OBD and OBP

In the appendix of the paper (https://arxiv.org/pdf/2008.07146.pdf) introducing OBD and OBP, it is mentioned in Appendix E.1 that the kernelized inverse probability weighting [Kallus et al., 2018] is implemented on the Open Bandit Pipeline.

How is the algorithm used in OBP with OBD containing discrete actions?
Do you use the real values contained in the first dimension of item_context.csv?

referenced paper:
[Kallus et al., 2018] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. AISTATS 2018.

Quickstart is not working without preparing obd directory

If we only installed obp with pip, the current Quickstart raises the following error:

>>> from obp.dataset import OpenBanditDataset
>>> from obp.policy import BernoulliTS
>>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-b5fa4ef965ba> in <module>
----> 1 dataset = OpenBanditDataset(behavior_policy='random', campaign='all')

~~~

FileNotFoundError: [Errno 2] No such file or directory: 'obd/random/all/all.csv'

​I understand that the obd directory should be prepared in advance (e.g., cloning this repository); however, the current Quickstart page does not provide such descriptions, which may confuse the users a little.
Therefore, I think it's a good idea to add such a note to the page.

I'm sorry if I misunderstood.

IPWLearner fit method does not allow position=None

In #70, I set position=None in bandit_feedback dict of synthetic data and classification data. However, the fit method of IPWLearner does not allow position=None.

What we have to do is to modify

if self.len_list == 1:
position = np.zeros_like(action, dtype=int)

to

if position is None or self.len_list == 1:
    position = np.zeros_like(action, dtype=int)

In addition, I realized that

else:
if not isinstance(position, np.ndarray) or position.ndim != 1:
raise ValueError(
f"when len_list > 1, position must be a 1-dimensional ndarray"
)

is unnecessary because it is tested in

check_bandit_feedback_inputs(

Training on test set?

Hi, on the README you write:

estimated_rewards_by_reg_model = regression_model.fit_predict(
    context=bandit_feedback_test["context"],
    action=bandit_feedback_test["action"],
    reward=bandit_feedback_test["reward"],
)

But this is basically fitting on test rewards. Is this legal?

Loading my own data?

I have my own log data which I would like to run OPE on. Is this supported? I don't see any way of bringing it into a odp.dataset.

Question: Is there an example usage of estimator_tuning?

Hi zr-obp creators. I am currently fiddling with the multiclass example. I encountered estimator_tuning that finds the best hyperparameter in terms of the estimated mse. Is there an example that uses this autotuning estimator?
Thanks!

what's the meaning of position in OPE?

position: array-like, shape (n_rounds,), default=None
Positions of each round in the given logged bandit feedback.

I don't quite understand the meaning of 'position' mentioned here, what does "the positions of each round" means?

propensity score estimate

Hello,

In the input dataset, propensity scores needs to be provided,
does the the propensity score needs to be calibrated ?

How is the impact of wrong propensity score ? (vs reward level...)

Question: LinTS takes a hours to run

Hello, first of all, thanks for the paper, it is really helpful. Just wanted to make sure that this is the expected behavior: LinTS takes >3h to run on the random/all big dataset (>1mln items. cpu only). Is this the expected behavior? Thank you!

The type check frustrate the policy value calculation

Thanks for your great work on OBP, I am trying to learn the basics of OBP with the notebook (online.ipynb).

In the section (3) Evaluation of OPE, I tried to use calc_ground_truth_policy_value for calculating the policy value but there is a type check.

if not np.issubdtype(int, action.dtype):

I use the following code to check the dtype of actions are int64.

epsilon_greedy.select_action().dtype # dtype('int64')

I commented out the type check and recompile the library, it works temporarily.

Python: 3.9.2
numpy: 1.20.2
OS: Window 10 x64

contents of "item_context.csv" in OBD are different for random policy and Bernoulli TS policy

I downloaded OBD from https://research.zozo.com/data.html#OpenBanditDataset

I checked item_context.csv in both "open_bandit_dataset/bts/all/item_context.csv" and "open_bandit_dataset/random/all/item_context.csv" and found that both of them contain same values for the item_feature_0. But different values for rest of the item_features or columns in the file (item_feature_1, item_feature_2, item_feature_3)

why are they differnt while they should be same since they describe same 80 items (or actions for the bandit) do they use different values to describe a same feature?

More precise logics for n_actions in Dataset and Simulator

Possible Issue

In bandit feedback, n_actions are set as int(self.action.max() + 1), which doesn't raise any error in above code,
assuming that logs generated by policy covered all possible actions.

However, to be more precise, I think n_actions should be explicitly given, rather than extracted from log data.
And if changed, the above code might raise error.
If 1000 possible actions and only 0~998 actions exist in bandit _feedback and somehow policy selected action 999,
this might raise out-of-index error.

Idea

  1. BanditFeedback data is given n_actions explicitly.
    Rather than:

    @property
    def n_actions(self) -> int:
    """Number of actions."""
    return int(self.action.max() + 1)

  2. Use n_actions directly in convert_to_action_dist
    Rather than:

    action_dist = convert_to_action_dist(
    n_actions=bandit_feedback["action"].max() + 1,
    selected_actions=np.array(selected_actions_list),
    )

Question: Does switch-dr/dr-os beat dr method in mse?

Hello OBP team!
I am experimenting with the ecoli dataset. Following is the mse results I get:

                                 mean       std
dm                           0.187152  0.031194
ipw                          0.007999  0.011608
snipw                        0.001898  0.002306
dr                           0.003618  0.004782
mrdr                         0.002018  0.002565
sndr                         0.001929  0.002461
switch-dr (tau=4)            0.066174  0.024147
switch-dr (tau=4.05)         0.057792  0.022635
switch-dr (tau=4.1)          0.049112  0.021502
switch-dr (tau=4.15)         0.037653  0.019118
switch-dr (tau=4.2)          0.025681  0.015920
switch-dr (tau=4.25)         0.008118  0.008474
switch-dr (tau=4.3)          0.003588  0.004752
switch-dr (tau=4.7)          0.003618  0.004782
switch-dr (tau=1000)         0.003618  0.004782
switch-dr (tau=4000)         0.003618  0.004782
switch-dr (tau=10000)        0.003618  0.004782
dr-os (lambda=1)             0.171959  0.029915
dr-os (lambda=10)            0.076683  0.021588
dr-os (lambda=100)           0.009266  0.008618
dr-os (lambda=300)           0.005007  0.006018
dr-os (lambda=500)           0.004375  0.005495
dr-os (lambda=1000)          0.003965  0.005122
dr-os (lambda=4000)          0.003699  0.004863
dr-os (lambda=7000)          0.003664  0.004828
dr-os (lambda=10000)         0.003650  0.004814
dr-os (lambda=20000)         0.003634  0.004798

I see a similar trend in the OBP paper (in the Random → Bernoulli TS columns) as well. Example:
Screen Shot 2021-09-27 at 10 43 59 AM

Although the results in columns Bernoulli TS → Random are better (That is, switch-dr/dr-os are beating dr).
Do you think these policies will make switch-dr/dr-os beat dr for multiclass/synthetic datasets?
In this case, the evaluation policy is Random/uniform. That is easy; just need to set alpha_e=0.
Can we use Bernoulli TS as a behavior policy for multiclass/synthetic datasets? Can you give me few pointers on how to do this?

Or if you have any pointers to make switch-dr/dr-os beat dr for multiclass datasets like in 20-Su+ or 17-Wang+, I'd be really grateful! :)

Thanks!

Switch DR current estimator has a typo

Switch DR equation from 17-Wang-Agarwal-Dudik is
Screen Shot 2021-09-15 at 3 52 54 PM

The current implementation does this:
Screen Shot 2021-09-15 at 3 52 30 PM

This should fix it:

        estimated_rewards = np.average(
            q_hat_at_position,
            weights=pi_e_at_position * switch_indicator_xt_all_actions,
            axis=1,
        )
        estimated_rewards += switch_indicator * iw * reward
        return estimated_rewards

Although I am not sure how to calculate switch_indicator_xt_all_actions since pscore for all actions isn't available. Thanks! :)

typo in quickstart

https://github.com/st-tech/zr-obp/blob/master/examples/quickstart/quickstart.ipynb

Since the difference cannot be seen clearly in jupyter notebook, I created this issue instead of a pull request.

(2) Offline Bandit Simulation
We use Bernoulli TS impelemted in => implemented

(3) Off-Policy Evaluation (OPE)
estimatorsand estiamte => estimators and estimate

In addition to the above, I think it is better to unify the following two expressions:

  • (2) Off-Policy Learning
  • (2) Offline Bandit Simulation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.