nth-iteration-labs / contextual Goto Github PK

Contextual Bandits in R - simulation and evaluation of Multi-Armed Bandit Policies

Home Page: https://nth-iteration-labs.github.io/contextual/

R 100.00%

contextual bandit simulation statistics multi-armed cmab contextual-bandits bandit-learning bandit-experiments reinforcement-learning

contextual's Introduction

Contextual: Multi-Armed Bandits in R

Overview

R package facilitating the simulation and evaluation of context-free and contextual Multi-Armed Bandit policies.

The package has been developed to:

Ease the implementation, evaluation and dissemination of both existing and new contextual Multi-Armed Bandit policies.
Introduce a wider audience to contextual bandit policies' advanced sequential decision strategies.

Package links:

Installation

To install contextual from CRAN:

install.packages('contextual')

To install the development version (requires the devtools package):

install.packages("devtools")
devtools::install_github('Nth-iteration-labs/contextual')

When working on or extending the package, clone its GitHub repository, then do:

install.packages("devtools")
devtools::install_deps(dependencies = TRUE)
devtools::build()
devtools::reload()

clean and rebuild...

Overview of core classes

Contextual consists of six core classes. Of these, the Bandit and Policy classes are subclassed and extended when implementing custom (synthetic or offline) bandits and policies. The other four classes (Agent, Simulator, History, and Plot) are the workhorses of the package, and generally need not be adapted or subclassed.

Documentation

See the demo directory for practical examples and replications of both synthetic and offline (contextual) bandit policy evaluations.

When seeking to extend contextual, it may also be of use to review "Extending Contextual: Frequently Asked Questions", before diving into the source code.

How to replicate figures from two introductory context-free Multi-Armed Bandits texts:

Basic, context-free multi-armed bandit examples:

Examples of both synthetic and offline contextual multi-armed bandit evaluations:

An example how to make use of the optional theta log to create interactive context-free bandit animations:

Interactive, animated versions of Epsilon Greedy, UCB1 and Thompson Sampling policies

Some more extensive vignettes to get you started with the package:

Paper offering a general overview of the package's structure & API:

An introduction to multi-armed bandit problems and the use of the R package contextual paper.

Policies and Bandits

Overview of contextual's growing library of contextual and context-free bandit policies:

General	Context-free	Contextual	Other
Random Oracle Fixed	Epsilon-Greedy Epsilon-First UCB1, UCB2 Thompson Sampling BootstrapTS Softmax Gradient Gittins	CMAB Naive Epsilon-Greedy Epoch-Greedy LinUCB (General, Disjoint, Hybrid) Linear Thompson Sampling ProbitTS LogitBTS GLMUCB	Lock-in Feedback (LiF)

Overview of contextual's bandit library:

Basic Synthetic	Contextual Synthetic	Offline	Continuous
Basic Bernoulli Bandit Basic Gaussian Bandit	Contextual Bernoulli Contextual Logit Contextual Hybrid Contextual Linear Contextual Wheel	Replay Evaluator Bootstrap Replay Propensity Weighting Direct Method Doubly Robust	Continuum

Alternative parallel backends

By default, "contextual" uses R's built-in parallel package to facilitate parallel evaluation of multiple agents over repeated simulation. See the demo/alternative_parallel_backends directory for several alternative parallel backends:

Maintainers

Robin van Emden: author, maintainer* Maurits Kaptein: supervisor*

* Tilburg University / Jheronimus Academy of Data Science.

If you encounter a clear bug, please file a minimal reproducible example on GitHub.

contextual's People

Contributors

Stargazers

Watchers

contextual's Issues

How to change the discount factor?

Hi,

this may be a stupid question but I don't see how I can choose the discount factor of the bandits? What value is it by default?

Thanks

Minor update required for EpsilonFirstPolicy object input in documentation article

Hi there, I am really excited to use your package to learn more about contextual bandit simulations.

I was trying out a particular documentation article (https://nth-iteration-labs.github.io/contextual/articles/introduction.html) and I think there might be some syntax issue with the parameter input for EpsilonFirstPolicy object.

# Initialize an EpsilonFirstPolicy with a 100 step exploration period.
ef_policy <- EpsilonFirstPolicy$new(first = 100)

I believe it should be time_steps = 100 instead of first = 100 based on the function help documentation as shown:

#Usage
policy <- EpsilonFirstPolicy(epsilon = 0.1, N = 1000, time_steps = NULL)

#Arguments
epsilon
numeric; value in the closed interval (0,1] that sets the number of time steps to explore through epsilon * N.

N
integer; positive integer which sets the number of time steps to explore through epsilon * N.

time_steps
integer; positive integer which sets the number of time steps to explore - can be used instead of epsilon and N.

Thanks for putting up such a complete package!

Help with creating a custom bandit. Error message: cannot add bindings to a locked environment

Hi, I'd like to create a custom bandit bernoulli where each context/state has a different probability of occuring. For instance, I want context 1 to appear 70% of the time, and context 2 only 30% of the time. Here is what I have tried:

I have copy-pasted the code of the ContextualBernoulliBandit, and added a prob argument to the initialize function, added a line with self$p <- length(self$prob) and adapted the line where the active feature is randomly chosen like so:

Xa <- sample(c(1,rep(0,self$d-1)), prob = self$p)

Below the full code of the class:

ContextualBernoulliBandit2 <- R6::R6Class(
    inherit = ContextualBernoulliBandit,
    class = FALSE,
    public = list(
        weights = NULL,
        class_name = "ContextualBernoulliBandit2",
        initialize = function(weights, prob) {
            self$weights     <- weights
            self$prob        <- prob
            if (is.vector(weights)) {
                self$weights <- matrix(weights, nrow = 1L)
            } else {
                self$weights <- weights               # d x k weight matrix
            }
            self$d           <- nrow(self$weights)  # d features
            self$k           <- ncol(self$weights)  # k arms
            **self$p           <- length(self$prob)**
        },
        get_context = function(t) {
            # generate d dimensional feature vector, one random feature active at a time
            Xa <- sample(c(1,rep(0,self$d-1)), prob = self$p)
            context <- list(
                X = Xa,
                k = self$k,
                d = self$d,
                **p = self$p**
            )
        },
        get_reward = function(t, context, action) {
            # which arm was selected?
            arm            <- action$choice
            # d dimensional feature vector for chosen arm
            Xa             <- context$X
            # weights of active context
            weight         <- Xa %*% self$weights
            # assign rewards for active context with weighted probs
            rewards        <- as.double(weight > runif(self$k))
            optimal_arm    <- which_max_tied(weight)
            reward  <- list(
                reward                   = rewards[arm],
                optimal_arm              = optimal_arm,
                optimal_reward           = rewards[optimal_arm]
            )
        }
    )
)

Now when I try to run this, I get the error message mentioned in the title:

horizon                           <- 10000L
simulations                       <- 1L

#                    S----M------------> Arm 1:   Sport
#                    |    |              Arm 2:   Movie
#                    |    |
weights <- matrix( c(0.4, 0.3,    #-----> Context: Male
                     0.8, 0.7),   #-----> Context: Female
                   
                   nrow = 2, ncol = 2, byrow = TRUE)

policy                            <- RandomPolicy$new()
bandit                            <- ContextualBernoulliBandit2$new(weights = weights, prob = c(0.7, 0.3))

Error in self$prob <- prob : cannot add bindings to a locked environment
--
 
I am not very familiar with R6 classes, which is the cause of this error. Any help would be appreciated!

Here's some info on my session:

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS/LAPACK: /home/cbrunos/miniconda3/envs/r_env/lib/R/lib/libRblas.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] contextual_0.9.8.2

loaded via a namespace (and not attached):
[1] codetools_0.2-16 foreach_1.4.7 R.methodsS3_1.7.1 R6_2.4.1 R.devices_2.16.1 itertools_0.1-3
[7] data.table_1.12.8 doParallel_1.0.15 R.oo_1.23.0 R.utils_2.9.2 Formula_1.2-3 rjson_0.2.20
[13] iterators_1.0.12 tools_3.6.1 parallel_3.6.1 compiler_3.6.1 base64enc_0.1-3

Saved the trained agent and hold the thetas unchanged for simulation on new dataset

Dear Robin,

This is not a bug report but more like a new feature request.

We know that the theta is updated after the agents' every interaction with the bandit. What I want to ask is that is it possible to save the "trained" agent with the theta for later use on another dataset. The logic behind this is that the trained agent acts as an oracle/ground truth of the environment, then I want to add a benchmark full information model based on this oracle.In this way, I can look at what is the maximum reward I can theoretically get if I initiate my offline evaluation with this oracle, without knowing the ground truth until the ends of my simulation.

Basically, to achieve this goal, I need to save the trained agents with the thetas, and break the thata updating chain and hold the thetas unchanged when used for another dataset.

Thank you so much for your help!

Best,
Han

abline() does not draw lines where expected

Thanks for your really quick fix last time. Here I'm having a problem with plotting. Using abline() does not draw a line where expected. Also, using grid() does not give the expected results.

Start R and run the example from ?EpsilonGreedyPolicy with the following:

library(contextual)

horizon     <- 100L
simulations <- 100L
weights     <- c(0.9, 0.1, 0.1)
policy      <- EpsilonGreedyPolicy$new(epsilon = 0.1)
bandit      <- BasicBernoulliBandit$new(weights = weights)
agent       <- Agent$new(policy, bandit)
history     <- Simulator$new(agent, horizon, simulations, do_parallel = FALSE)$run()
plot(history, type = "cumulative")
abline(h=0, col=3, lwd=4)
abline(h=1, col=3, lwd=4)
abline(v=0, col=3, lwd=4)
abline(v=1, col=3, lwd=4)
grid()

In the image below, the green lines show the locations of 0 and 1 according to abline. The misalignment with the axis labels makes it difficult to use abline on contextual plots. Also, the faint dashed lines of grid do not align with the axis ticks.

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS release 6.10 (Final)
##
## other attached packages:
## [1] data.table_1.11.4  contextual_0.9.8.2

Example in ContextualBinaryBandit doc does not run

I very much like your contextual package! Thanks for sharing it.

Below is a description of how to reproduce this bug (possibly just a documentation bug).

Start R and run the example from ?ContextualBinaryBandit with the following:

library(contextual)
library(data.table)

horizon <- 100
sims    <- 100
policy  <- EpsilonGreedyPolicy$new(epsilon = 0.1)
bandit  <- ContextualBinaryBandit$new(weights = c(0.6, 0.1, 0.1))
agent   <- Agent$new(policy,bandit)
##  Error in rep(list(self$theta_to_arms[[param_index]]), k) : 
##    invalid 'times' argument

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS release 6.10 (Final)
##
## other attached packages:
## [1] data.table_1.11.4  contextual_0.9.8.2

Setting random seed outside of Simulator

Hello Robin, I wanted to point out the following as it confused me for a while.

I was trying to set a random seed outside of Simulator but my data was not being randomised properly.

Then I remembered that you mention in your documentation that calling Simulator sets a random seed so that the simulations are replicable.

Because Simulator resets the random seed each time it's called, naively doing the following:

horizon     <- 100L
simulations <- 100L
weights     <- c(0.9, 0.1, 0.1)
policy      <- EpsilonGreedyPolicy$new(epsilon = 0.1)
bandit      <- BasicBernoulliBandit$new(weights = weights)
agent       <- Agent$new(policy, bandit)

for (i in 1:2) {
  history <- Simulator$new(agent, horizon, simulations)$run()
  print(runif(1))
}

results in identical values from the two calls of runif(1).

I came up with the following simple solution which works in my case (it sets a unique seed from the loop index):

for (i in 1:2) {
  history <- Simulator$new(agent, horizon, simulations, set_seed= i)$run()
  print(runif(1))
}

However, I wondered if there was a way for Simulator to only set its own internal random seed and not reset it globally each time it is called.

Confusing assignments in method get_reward for OfflineDoublyRobustBandit

I'm studying contextual bandits so I found very useful this library to understand the algorithms involved. In particular, I was trying to review the implementation of OfflineDoublyRobustBandit, and I got confused about the definition of "inverted" parameter in these lines:

        if (self$inverted) p <- 1 / p
        if (self$threshold > 0) {
          if (isTRUE(self$inverted))  p <- 1 / p
          p <- 1 / max(p,self$threshold)
        } else {
          if (!isTRUE(self$inverted)) p <- 1 / p
        }

I think the last line could be a bug (since it inverts p even if "inverted" is FALSE), and it seems like this could be done with something simpler like:

        if (self$threshold > 0) {
          p <- 1 / max(p,self$threshold)
        } else {
          if (isTRUE(self$inverted)) p <- 1 / p
        }

Could someone confirm if that is a bug or not? And what's the logic about that parameter? Thanks in advance.

Arm choice sequence from the simulation?

I am using the contextual package to run some simulations. More specifically, I am using the LinUCBDisjointOptimizedPolicy. Is there a way for me to get the arm choice sequence from the simulation?

Some minor clarifications in the documentation

This is quite a minor point, but the documentation from ?LinUCBGeneralPolicy mentions "Algorithm 1 LinUCB" in the paper by Lihong Li et all (2010), whereas the documentation from ?LinUCBDisjointPolicy does not. However, comparing Algorithm 1 from the paper with your code it seems that LinUCBDisjointPolicy is exactly "Algorithm 1 LinUCB" from the paper, whereas LinUCBGeneralPolicy is similar but not the same (more like the other part of the LinUCB hydrid model that is not in LinUCBDisjointPolicy). If I'm right, I think it would helpful for the documentation to state explicitly that LinUCBDisjointPolicy is "Algorithm 1 LinUCB" from Li's paper.

Also, the description for the documentation from ?LinUCBHybridPolicy refers to LinUCBHybridOptimizedPolicy. I guess that both LinUCBHybridPolicy and LinUCBHybridOptimizedPolicy are exactly "Algorithm 2 LinUCB with hybrid linear models" from Li's paper. Again, if I'm correct, I think it would helpful to state this explicitly in the documentation.

Example of customized context-free bandits

I have offline data available in the following format:
time | reward_arm1 | reward_arm2| reward_arm3|
This is not a contextual bandit case as there is only reward of each arm at time t. I want to implement UCB1, UCB2 and other context-free algorithms on this data. I searched for a demo on context free custom data however, I could not find any. So, I am creating a custom context-free bandit from one of the demos available.

@robinvanemden
For example in your myocardial example, there are two arms - treatment and no_treatment - each has its own reward. The computed R1 and R2 in the example can be the rewards of the arms.

I tried to create a new dataset with just these two columns R1 and R2 as follows:

# Import myocardial infection dataset

url             <- "http://d1ie9wlkzugsxr.cloudfront.net/data_propensity/myocardial_propensity.csv"
data            <- fread(url)

simulations     <- 1
horizon         <- nrow(data)


data$trt        <- data$trt + 1


data$alive      <- abs(data$death - 1)


f                <- alive ~ age + risk + severity

model_f          <- function(arm) glm(f, data=data[trt==arm], family=binomial(link="logit"), y=F, model=F)
arms             <- sort(unique(data$trt))
model_arms       <- lapply(arms, FUN = model_f)

predict_arm      <- function(model) predict(model, data, type = "response")
r_data           <- lapply(model_arms, FUN = predict_arm)
r_data           <- do.call(cbind, r_data)
colnames(r_data) <- paste0("R", (1:max(arms)))
data             <- cbind(data,r_data)

# extracting only R1 and R2
data        <- data[,8:9]

I am trying to creating a bandit out of this data and run the algorithms but it does not work.

# New-changed formula

#f                <- alive ~ trt | age + risk + severity | R1 + R2
#2 
f                <- alive ~ R1 + R2

#bandit           <- OfflineDirectMethodBandit$new(formula = f, data = data)
bandit           <- OfflineDirectMethodBandit$new( data = data)

# Define agents.

#agents      <- list(Agent$new(LinUCBDisjointOptimizedPolicy$new(0.2), bandit, "LinUCB"))
agents      <- list(Agent$new(UCB1Policy$new(), bandit, "UCB"))


simulation  <- Simulator$new(agents = agents, simulations = simulations, horizon = horizon, do_parallel = FALSE)


sim  <- simulation$run()


plot(sim, type = "cumulative", regret = FALSE, rate = TRUE, legend_position = "bottomright")

Especially, the simulation step does not run. Can you please point me towards a minimum working example of running UCB1, or UCB2 on a context-free custom data?

CRAN packages' problems on R-devel: `[[<-`(NULL, *)

[[s]] <- V now consistently gives list() -- fixes dimnames(.)[[1]] <- "A"

The change in itself had been discussed somewhat on the R-devel
mailing list before it happened,
starting here:
https://stat.ethz.ch/pipermail/r-devel/2020-February/079049.html

and also a bit afterwards:
https://stat.ethz.ch/pipermail/r-devel/2020-February/079061.html

contextual do_parallel dosnt work on MRAN 3.5.1

Hi!
I try to run on my MRAN 3.5.1:
history <- Simulator$new(agents = agent,
horizon = horizon,
simulations = 1000,do_parallel = T)$run()
but recieve :
Setting up parallel backend.
Cores available: 4
Workers assigned: 3
Simulation horizon: 250
Number of simulations: 5000 # this also stay unchanged!
Number of batches: 3
Starting main loop.
Error in gp$globals[[match(s, syms)]] : subscript out of bounds

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C LC_TIME=Russian_Russia.1251

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] itertools_0.1-3 iterators_1.0.10 data.table_1.11.4 contextual_0.9.8.3 RevoUtils_11.0.1
[6] RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] codetools_0.2-15 listenv_0.7.0 future_1.9.0 withr_2.1.2 digest_0.6.15 foreach_1.5.0
[7] R.methodsS3_1.7.1 R6_2.3.0 R.devices_2.16.0 doParallel_1.0.13 R.oo_1.22.0 R.utils_2.6.0
[13] devtools_1.13.6 Formula_1.2-3 rjson_0.2.20 tools_3.5.1 yaml_2.2.0 parallel_3.5.1
[19] compiler_3.5.1 base64enc_0.1-3 globals_0.12.1 memoise_1.1.0

Contextual, determinism and setting seeds

By setting a new seed each time another round of simulation is run (if the number of simulations run is greater than 1), is the program reordering the input data? For example, if we have 10 data points, would they be ordered differently across different simulations? If not, what exactly is reassigning the seed doing in each round of simulation?

minor correction for statement in demo "Demo: Bandits, Propensity Weighting & Simpson’s Paradox in R"

Hi Robin, I was looking through your recent demo on the Simpson's paradox and I realised that there might be a wrong/false statement.

Instead of " you’d falsely conclude Sports is more popular then Movies, overall", I think the statement should be alluding to the wrong conclusion that "Movies is more popular than Sports".

Thanks for coming up with the awesome package!

Save predicted reward for chosen arm (feature request)

Hello Robin,

This is a feature request, not a bug report.

I'd like the output from history$get_data_table() to include a column for the predicted values of the chosen arms at each step.

For example, for EpsilonGreedyPolicy it would just be self$theta$mean[[chosen_arm]], which I realise is available by setting save_theta = TRUE in Simulator$new. If I also set save_context = TRUE the predicted value of the chosen action can be obtained. (Although I have to take into account the fact that the theta values are one time step ahead of the values for the current context-arm pair since they have been updated with the reward from the current context-arm pair. That is, the theta values do not hold the predicted values for the current context-arm pair since they hold the values computed after the reward for the current context-arm pair is known.)

With other policies, such as ContextualEpsilonGreedyPolicy, using the output from history$get_data_table() to compute the expected reward for the current action before it is taken is not so straightforward. I see in policy_cmab_lin_epsilon_greedy.R that you compute expected_rewards[arm], but you don't seem to save the values for output later on. It is exactly expected_rewards[arm] that I would like history$get_data_table() to include in its output. Having expected_rewards[arm] for just the chosen arm would be enough for my current needs, but maybe having expected_rewards[arm] for all arms would be useful in future.

I had a look at history.R to see if I could work out how to save the values of expected_rewards, but it looks rather complicated to me and my R is nowhere near as good as yours :-).

Thanks,

Paul

Possible bug in Exp3Policy

I was reviewing the implementation of Exp3 algorithm in this function, with the help of the formula posted in this link. I think the last statement of the formula is wrong, since it is only updating the last element of probs and not all the elements (seems like it should be inside the loop):

    get_action = function(t, context) {
      probs <- rep(0.0, context$k)
      for (i in 1:context$k) {
         probs[i] <- (1 - gamma) * (self$theta$weight[[i]] / sum_of(self$theta$weight))
      }
      inc(probs[i])  <- ((gamma) * (1.0 / context$k))  # <--------
      action$choice  <- categorical_draw(probs)
      action
    },

So I think it must be corrected as it follows:

    get_action = function(t, context) {
      probs <- rep(0.0, context$k)
      for (i in 1:context$k) {
         probs[i] <- (1 - gamma) * (self$theta$weight[[i]] / sum_of(self$theta$weight))
         inc(probs[i])  <- ((gamma) * (1.0 / context$k))  # <-------
      }  
      action$choice  <- categorical_draw(probs)
      action
    },

Could you confirm this is a bug? Then I can make a simple PR to correct it.

Typo in documentation for ContextualEpochGreedyPolicy

In the documentation from ?ContextualEpochGreedyPolicy it says, under Usage,

policy <- EpsilonGreedyPolicy(epsilon = 0.1)

which is a different policy, that is, epsilon greedy instead of epoch greedy.

I think the correction would be

policy <- ContextualEpochGreedyPolicy $ new (sZl = 0.1)

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS release 6.10 (Final)
##
## other attached packages:
## [1] data.table_1.11.4  contextual_0.9.8.2

Package removed from CRAN

The package has been removed from CRAN for a package check problem. https://cran.r-project.org/package=contextual

Forthcoming release of ggplot2 and contextual

We are contacting you because you are the maintainer of contextual, which imports ggplot2 and uses vdiffr to manage visual test cases. The upcoming release of ggplot2 includes several improvements to plot rendering, including the ability to specify lineend and linejoin in geom_rect() and geom_tile(), and improved rendering of text. These improvements will result in subtle changes to your vdiffr dopplegangers when the new version is released.

Because vdiffr test cases do not run on CRAN by default, your CRAN checks will still pass. However, we suggest updating your visual test cases with the new version of ggplot2 as soon as possible to avoid confusion. You can install the development version of ggplot2 using remotes::install_github("tidyverse/ggplot2").

If you have any questions, let me know!

Missing Data

In your Section 7.1 of your paper you read in data from https://raw.githubusercontent.com/Nth-iteration-labs/contextual_data/master/data_cmab_basic/dataset.txt. However, trying to read it with fread and visiting it in a browser result in a 404 error.

Without the data I am unable to reproduce that part of the paper. Are there plans to restore the data or is there another location?

How to get predictions of new test data?

First, thank you for the awesome package.
I'm wondering how to get a prediction from the test data.
Let's say that I build online advertisement recommendation system using multi armed bandit problem. I build simulator and history for certain data(Replay Evaluator bandit, Linear UCB policy with 100-dimensional context vectors). I want to see what's the optimal output(what advertisement should be represented) with my test data.
I tried "predict" function with simulator or agents, and none of them worked. I really look forward to your help.
Thanks!

invalid argument 'k' in Policy$initialize_theta(k)

the argument 'k' in the initialize_theta(k) function in 'policy.R' is invalid.

Leading to the following error when creating an Agent from any Policy:

Error in rep(list(self$theta_to_arms[[param_index]]), k) :
invalid 'times' argument

cannot have more than two simulations per epoch using benchmark MAB policy in offline bandit CMAB policy evaluation

Dear developers,

Thank you for your developing and maintaining the CMAB package. I benefit a lot from it in my own research.

I encounter a problem recently. I use offline bandit and CMAB (ConlinTS and ConlinUCB) policies for offline bandit evaluation. I also include MAB policy (UCB1 and TS) as benchmark in the same agent definition together with CMAB policies for simulation.

Here is my code:

f2 <- DV~ arm| covariates| r.1...|p
bandit <- OfflineDoublyRobustBandit$new(formula = f2, data = data, randomize = FALSE)
agents <- list(Agent$new(LinUCBDisjointOptimizedPolicy$new(1), bandit, "LinUCB"),
Agent$new(ContextualLinTSPolicy$new(v=0.2), bandit, "ConLinTS"),
Agent$new(EpsilonGreedyPolicy$new(epsilon = 0.5), bandit, "EGreedy"),
Agent$new(UCB1Policy$new(), bandit, "UCB1"),
Agent$new(ThompsonSamplingPolicy$new(1,1), bandit, "TS"),
Agent$new(RandomPolicy$new(), bandit, "Random"))

simulation <- Simulator$new(agents = agents, simulations = 1, horizon = 30,000, save_context = TRUE, worker_max=32)

It runs well if the number of simulations is set to 1 or 2. However, if I want to do more than 2 simulations, I got error --"Error in { : task 1 failed - "missing value where TRUE/FALSE needed"" after it started main loop.

I did several debuggings, and found this problem may occur only for the MAB policies. I can run 10 simulations if I use CMAB and Random policies separately. But I can only run 2 simulations for MAB policies. More than 2 simulations render the same error. In addition, there are warning messages. In addition: Warning messages:
1: In for (v in val) { :
closing unused connection 5 (<-kubernetes.docker.internal:11190)
2: In for (v in val) { :
closing unused connection 4 (<-kubernetes.docker.internal:11190)
3: In for (v in val) { :
closing unused connection 3 (<-kubernetes.docker.internal:11190)

Do you have any idea what happens?

Thank you so much!