RNN (LSTM or GRU)
- guesses the opponent policy distribution based on sequence of previous moves
- difficult to learn all possible opponent strategies (deeper network?)
UCB (Upper Confidence Bound)1
- balance exploitation/exploration
- predictable by opponent
- needs high exploration constant and/or frequent resets
PUCB (Predictor + Upper Confidence Bound)2
- predictor (ideally) detects changes in opponent stategy
SER4 (Successive Elimation Rounds with Randomized Round-Robin and Resets)3
- runs several random trials to find move with highest mean reward
- assumes constant oppponent distribution
- bad against high variance strategies
EXP3.R (EXP3 with Resets)3 4
- updates probabilities based on mean rewards and prior
- reset based on detection of maximum mean reward drift
- good against exploitation-biased strategies
Bayesian (Thompson Sampling)5
- use beta distribution to model reward probabilities and update based on observations
- assumes constant opponent distribution
1: https://link.springer.com/article/10.1023/A:1013689704352
2: https://link.springer.com/article/10.1007%2Fs10472-011-9258-6
3: https://link.springer.com/article/10.1007/s41060-017-0050-5