Awesome Exploration Methods in Reinforcement Learning

Updated on 2024.06.12

Here is a collection of research papers for Exploration methods in Reinforcement Learning (ERL). The repository will be continuously updated to track the frontier of ERL. Welcome to follow and star!
The balance of exploration and exploitation is one of the most central problems in reinforcement learning. In order to give readers an intuitive feeling for exploration, we provide a visualization of a typical hard exploration environment in MiniGrid below. In this task, a series of actions to achieve the goal often require dozens or even hundreds of steps, in which the agent needs to fully explore different state-action spaces in order to learn the skills required to achieve the goal.

A typical hard-exploration environment: MiniGrid-ObstructedMaze-Full-v0.

A Taxonomy of Exploration RL Methods
Papers
Contributing

A Taxonomy of Exploration RL Methods

(Click to Collapse)

In general, we can divide reinforcement learning process into two phases: collect phase and train phase. In the collect phase, the agent chooses actions based on the current policy and then interacts with the environment to collect useful experience. In the train phase, the agent uses the collected experience to update the current policy to obtain a better performing policy.

According to the phase the exploration component is explicitly applied, we simply divide the methods in Exploration RL into two main categories: Augmented Collecting Strategy, Augmented Training Strategy:

Augmented Collecting Strategy represents a variety of different exploration strategies commonly used in the collect phase, which we further divide into four categories:
- Action Selection Perturbation
- Action Selection Guidance
- State Selection Guidance
- Parameter Space Perturbation
Augmented Training Strategy represents a variety of different exploration strategies commonly used in the train phase, which we further divide into seven categories:
- Count Based
- Prediction Based
- Information Theory Based
- Entropy Augmented
- Bayesian Posterior Based
- Goal Based
- (Expert) Demo Data

Note that there may be overlap between these categories, and an algorithm may belong to several of them. For other detailed survey on exploration methods in RL, you can refer to Tianpei Yang et al and Susan Amin et al.

A non-exhaustive, but useful taxonomy of methods in Exploration RL. We provide some example methods for each of the different categories, shown in blue area above.

Here are the links to the papers that appeared in the taxonomy:

[1] Go-Explore: Adrien Ecoffet et al, 2021
[2] NoisyNet, Meire Fortunato et al, 2018
[3] DQN-PixelCNN: Marc G. Bellemare et al, 2016
[4] #Exploration Haoran Tang et al, 2017
[5] EX2: Justin Fu et al, 2017
[6] ICM: Deepak Pathak et al, 2018
[7] RND: Yuri Burda et al, 2018
[8] NGU: Adrià Puigdomènech Badia et al, 2020
[9] Agent57: Adrià Puigdomènech Badia et al, 2020
[10] VIME: Rein Houthooft et al, 2016
[11] EMI: Wang et al, 2019
[12] DIYAN: Benjamin Eysenbach et al, 2019
[13] SAC: Tuomas Haarnoja et al, 2018
[14] BootstrappedDQN: Ian Osband et al, 2016
[15] PSRL: Ian Osband et al, 2013
[16] HER Marcin Andrychowicz et al, 2017
[17] DQfD: Todd Hester et al, 2018
[18] R2D3: Caglar Gulcehre et al, 2019

Papers

format:
- [title](paper link) (presentation type, openreview score [if the score is public])
  - author1, author2, author3, ...
  - Key: key problems and insights
  - ExpEnv: experiment environments

ICLR 2024

(Click to Collapse)

Unlocking the Power of Representations in Long-term Novelty-based Exploration
- Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot
- Key: Robust Exploration via Clustering-based Online Density Estimation
- ExpEnv: Atari, DM-HARD-8
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
- Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
- Key: Stochastic Environments, effective horizon, RL theory, instance-dependent bounds, empirical validation of theory
- ExpEnv: BRIDGE
DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
- Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu
- Key: Visual RL, Dormant Ratio Minimization, Exploration
- ExpEnv:DeepMind Control Suite, MetaWorld, and Adroit
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
- Seohong Park, Oleh Rybkin, Sergey Levine
- Key: unsupervised RL, metric-aware abstraction, scalable exploration
- ExpEnv: state-based Ant and HalfCheetah, Kitchen
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
- Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu
- Key: reward shaping, language models, text-based reward shaping
- ExpEnv: MUJOCO, MANISKILL2, METAWORLD
Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning
- Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu
- Key: goal-based models, pre-training, sample efficiency
- ExpEnv: Kitchen, Minecraft.
Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
- Hyungho Na, Yunkyeong Seo, Il-chul Moon
- Key: episodic memory, cooperative multi-agent, efficient utilization
- ExpEnv: StarCraft II and Google Research Football
Simple Hierarchical Planning with Diffusion
- Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn
- Key: hierarchical planning, diffusion, exploration
- ExpEnv: Maze2D and AntMaze
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
- Ziping Xu, Zifan Xu, Runxuan Jiang, Peter Stone, Ambuj Tewari
- Key: myopic exploration, multitask reinforcement learning, diverse tasks
- ExpEnv: synthetic robotic control environment
PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
- Zhe Wu, Haofei Lu, Junliang Xing, You Wu, Renye Yan, Yaozhong Gan, Yuanchun Shi
- Key: external knowledge, efficient exploration, reinforcement learning
- ExpEnv: BabyAI and MiniHack
In-context Exploration-Exploitation for Reinforcement Learning
- Zhenwen Dai, Federico Tomasi, Sina Ghiassian
- Key: in-context exploration-exploitation, reinforcement learning, exploration-exploitation trade-off
- ExpEnv: Dark Room, Dark Key-to-Door, Dark Room (Biased).
Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
- Licong Lin, Yu Bai, Song Mei
- Key: transformers, decision makers, in-context reinforcement learning
- ExpEnv: Linear bandit, Bernoulli bandits.
Learning to Act without Actions
- Dominik Schmidt, Minqi Jiang
- Key: recovering latent action information, video, pre-training
- ExpEnv: Procgen
Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
- Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio
- Key: spatio-temporal abstractions, hierarchical planning, task/goal decomposition
- ExpEnv: MiniGrid-BabyAI

NeurIPS 2023

(Click to Collapse)

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration
- Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
- Key: a single objective that integrates the estimation and planning components, balancing exploration and exploitation automatically, sublinear regret
- ExpEnv: MuJoCo with sparse reward
On the Importance of Exploration for Generalization in Reinforcement Learning
- Yiding Jiang, J Zico Kolter, Roberta Raileanu
- Key: exploration, generalization, Exploration via Distributional Ensemble
- ExpEnv: tabular contextual MDP, Procgen and Crafter
Monte Carlo Tree Search with Boltzmann Exploration
- Michael Painter, Mohamed Baioumy, Nick Hawes, Bruno Lacerda
- Key: Boltzmann exploration with MCTS, optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective, two improved algorithms.
- ExpEnv: the Frozen Lake environment, the Sailing Problem, Go
Breadcrumbs to the Goal: Supervised Goal Selection from Human-in-the-Loop Feedback
- Marcel Torne Villasevil, Max Balsells I Pamies, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, Abhishek Gupta
- Key: human-in-the-loop feedback, bifurcating human feedback and policy learning
- ExpEnv: Bandu, Block Stacking, Kitchen, and Pusher，Four rooms and Maze
MIMEx: Intrinsic Rewards from Masked Input Modeling
- Toru Lin, Allan Jabri
- Key: pseudo-likelihood estimation with different mask distributions,
- ExpEnv: PixMC-Sparse, DeepMind Control suite
Accelerating Exploration with Unlabeled Prior Data
- Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
- Key: prior data without reward labels, learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards
- ExpEnv: AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain.
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
- Shuai Zhang, Hongkang Li, Meng Wang, Miao Liu, Pin-Yu Chen, Songtao Lu, Sijia Liu, Keerthiram Murugesan, Subhajit Chaudhury
- Key: ε-greedy exploration, convergence, sample complexity
- ExpEnv: Numerical Experiments
Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion
- Taehyun Cho, Seungyub Han, Heesoo Lee, Kyungjae Lee, Jungwoo Lee
- Key: distributional reinforcement learning, randomizing risk criterion, optimistic exploration
- ExpEnv: Atari 55 games.
CQM: Curriculum Reinforcement Learning with a Quantized World Model
- Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
- Key: curriculum reinforcement learning, quantized world model, quantized world model
- ExpEnv: PointNMaze
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
- Akifumi Wachi, Wataru Hashimoto, Xun Shen, Kazumune Hashimoto
- Key: safe exploration, generalized formulation, safe exploration algorithms, Meta-Algorithm for Safe Exploration
- ExpEnv: grid-world and Safety Gym
Successor-Predecessor Intrinsic Exploration
- Changmin Yu, Neil Burgess, Maneesh Sahani, Samuel J. Gershman
- Key: retrospective structure of transition sequences, combining prospective and retrospective information
- ExpEnv: grid worlds, MountainCar, Atari
Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration
- Dongyoung Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo
- Key: value-conditional state entropy exploration
- ExpEnv: MiniGrid, DeepMind Control Suite, and Meta-World
ELDEN: Exploration via Local Dependencies
- Zizhao Wang, Jiaheng Hu, Peter Stone, Roberto Martín-Martín
- Key: local dependencies, exploration bonus, intrinsic motivation, encourages the discovery of new interactions between entities
- ExpEnv: 2D grid worlds to 3D robotic tasks

ICML 2023

(Click to Collapse)

A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs
- Mikael Henaff, Minqi Jiang, Roberta Raileanu
- Key: global novelty bonuses, episodic novelty bonuses, shared structure
- ExpEnv: Mini-Hack suite, Habitat and Montezuma’s Revenge
Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments
- Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko
- Key: stochastic environments, disentangle “noise” from “novelty”, BYOL-Hindsight
- ExpEnv: Pycolab Maze, Atari, Bank Heist
Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition
- Yash Chandak, Shantanu Thakoor, Zhaohan Daniel Guo, Yunhao Tang, Remi Munos, Will Dabney, Diana Borsa
- Key: singular value decomposition, relative frequency of state visitations, scale this decomposition method to large-scale domains
- ExpEnv: DMLab-30, DM-Hard-8
Reparameterized Policy Learning for Multimodal Trajectory Optimization
- Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su
- Key: multimodal policy parameterization, a generative model of optimal trajectories
- ExpEnv: bandit, MetaWorld, 2D maze
Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning
- Sam Lobel, Akhil Bagaria, George Konidaris
- Key: count-based exploration, veraging samples from the Rademacher distribution (or coin flips)
- ExpEnv: Atari, D4RL, FETCH
Fast Rates for Maximum Entropy Exploration
- Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Rémi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Ménard
- Key: visitation entropy maximization, game-theoretic algorithm, trajectory entropy
- ExpEnv: Double Chain MDP
Guiding Pretraining in Reinforcement Learning with Large Language Models
- Yuqing Du, OliviaWatkins, Zihan Wang, CÅLedric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, Jacob Andreas
- Key: uses background knowledge from text corpora to shape exploration, rewards an agent for achieving goals suggested by a language model prompted with a description of the agent’s current state.
- ExpEnv: Crafter, Housekeep
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling
- Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, Roy Fox
- Key: Abstract World Model (AWM) for planning and exploration, LLM-guided exploration, Dream phase and Wake phase,
- ExpEnv: Minecraft
Cell-Free Latent Go-Explore
- Quentin GallouÅLedec, Emmanuel DellandrÅLea
- Key: Latent Go-Explore, a learned latent representation
- ExpEnv: 2D maze, panda-gym, Atari
Go Beyond Imagination: Maximizing Episodic Reachability with World Models
- Yao Fu, Run Peng, Honglak Lee
- Key: an episodic intrinsic reward that is designed to maximize the stepwise reachability expansion
- ExpEnv: Minigrid, DeepMind Control Suite
Efficient Online Reinforcement Learning with Offline Data
- Philip J. Ball, Laura Smith, Ilya Kostrikov, Sergey Levine
- Key: Sample efficiency and exploration, simply apply existing off-policy methods to leverage offline data when learning online, key factors that most affect performance, a set of recommendations
- ExpEnv: D4RL AntMaze, Locomotion, Adroit
Anti-Exploration by Random Network Distillation
- Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, Sergey Kolesnikov
- Key: an uncertainty estimator, anti-exploration bonus, Feature-wise Linear Modulation
- ExpEnv: D4RL
The Impact of Exploration on Convergence and Performance of Multi-Agent Q-Learning Dynamics
- Aamal Hussain, Francesco Belardinelli, Dario Paccagnan
- Key: How does exploration affect reinforcement learning dynamics in arbitrary games, even if convergence to an equilibrium cannot be guaranteed?
- ExpEnv: Network Shapley Game, Network Chakraborty Game, Arbitrary Games
An Adaptive Entropy-Regularization Framework for Multi-Agent Reinforcement Learning
- Woojun Kim, Youngchul Sung
- Key: adaptive entropyregularization framework, proper level of exploration entropy, disentangled value function
- ExpEnv: SMAC, multi-agent HalfCheetah
Lazy Agents: A New Perspective on Solving Sparse Reward Problem in Multi-agent Reinforcement Learning
- Boyin Liu, Zhiqiang Pu, Yi Pan, Jianqiang Yi, Yanyan Liang, Du Zhang
- Key: Lazy Agents Avoidance through Influencing External States, individual diligence intrinsic motivation (IDI) and collaborative diligence intrinsic motivation (CDI), external states transition model
- ExpEnv: SMAC, Google Research Football
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning
- Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
- Key: selects shaping function from a predefined set, an intrinsic reward toolkit
- ExpEnv: MiniGrid, Procgen, and DeepMind Control Suite
LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework
- Woojun Kim, Jeonghye Kim, Youngchul Sung
- Key: option-critic model, adaptively select the most effective exploration strategy
- ExpEnv: MiniGrid and Atari

ICLR 2023

(Click to Collapse)

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection (Oral: 10, 8, 8)
- Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye HAO, Bin Wang, Jiangcheng Zhu, Hao Wang, Shu-Tao Xia
- Key: Learnable Behavioral Control, hybrid behavior mapping, a unified learnable process for behavior selection, bandit-based metacontrollers
- ExpEnv: Atari
The Role of Coverage in Online Reinforcement Learning (Oral: 8, 8, 5)
- Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, Sham M. Kakade
- Key: coverage conditions, data logging distribution, sample-efficient exploration, sequential extrapolation coefficient
- ExpEnv: None
Near-optimal Policy Identification in Active Reinforcement Learning (Oral: 8,8,8)
- Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, Ilija Bogunovic
- Key: kernelized least-squares value iteration, combines optimism with pessimism for active exploration
- ExpEnv: Cartpole, Navigation, Tracking, Rotation, Branin-Hoo, Hartmann
Planning Goals for Exploration (Spotlight: 8, 8, 8, 8, 6)
- Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman
- Key: goal-conditioned, planning exploratory goals, world models, sampling-based planning algorithms
- ExpEnv: Point Maze, Walker, Ant Maze, 3-Block Stacking
Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning (Spotlight: 8, 8, 8)
- Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, Georg Martius
- Key: continuous action spaces, temporally correlated noise, colored noise
- ExpEnv: DeepMind Control Suite, Atari, Adroit hand suite
Learning About Progress From Experts (Spotlight: 8, 8, 6)
- Jake Bruce, Ankit Anand, Bogdan Mazoure, Rob Fergus
- Key: the use of expert demonstrations, long-horizon tasks, learn a monotonically increasing function that summarizes progress.
- ExpEnv: NetHack
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems (Spotlight: 10, 8, 8, 8)
- Pierre Schumacher, Daniel Haeufle, Dieter Büchler, Syn Schmitt, Georg Martius
- Key: large overactuated action spaces, differential extrinsic plasticity, state-space covering exploration.
- ExpEnv: musculoskeletal systems: torquearm, arm26, humanreacher, ostrich-foraging, ostrich-run, human-run, human-hop
Does Zero-Shot Reinforcement Learning Exist? (Spotlight: 10, 8, 8,3)
- Ahmed Touati, Jérémy Rapin, Yann Ollivier
- Key: zero-shot RL agent, disentangle universal representation learning from exploration, SFs with Laplacian eigenfunctions.
- ExpEnv: Unsupervised RL and ExORL benchmarks
Human-level Atari 200x faster (Poster: 8, 8, 3)
- Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakicevic, Hado van Hasselt, Charles Blundell, Adria Puigdomenech Badia
- Key: 200-fold reduction of experience, a more robust and efficient agent
- ExpEnv: Atari 57
Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward (Poster: 8, 8, 5, 5)
- Zihan Zhou, Animesh Garg
- Key: achievement-based environments, recovered dependency graph
- ExpEnv: Crafter, TreeMaze
Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-Free RL (Poster: 8, 8, 6, 6)
- Ruiquan Huang, Jing Yang, Yingbin Liang
- Key: reward-free reinforcement learning, reduce the uncertainty in the estimated model with minimum number of trajectories.
- ExpEnv: tabular MDPs, Low-rank MDP
Latent State Marginalization as a Low-cost Approach to Improving Exploration (Poster: 6, 6, 6)
- Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, Ricky T. Q. Chen
- Key: adoption of latent variable policies within the MaxEnt framework, low-cost marginalization of the latent state
- ExpEnv: DeepMind Control Suite
Revisiting Curiosity for Exploration in Procedurally Generated Environments (Poster: 8, 8, 5, 3, 3)
- Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN
- Key: lifelong intrinsic rewards and episodic intrinsic rewards，the performance of all lifelong-episodic combinations
- ExpEnv: MiniGrid
MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations (Poster: 8, 6, 6, 6)
- Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran
- Key: Key ingredients for leveraging demonstrations in model learning
- ExpEnv: Adroit, Meta-World, DeepMind Control Suite
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (Poster: 8, 6, 6, 6, 6)
- Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Russ Salakhutdinov
- Key: alignment between these auxiliary objectives and the RL objective, a lower bound on expected returns
- ExpEnv: model-based benchmark
EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model (Poster: 6, 6, 6, 6)
- Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Jinyi Liu, Yingfeng Chen, Changjie Fan
- Key: transition dynamics modeling, multi-choice dynamics model, sampling efficiency
- ExpEnv: URLB
Guarded Policy Optimization with Imperfect Online Demonstrations (Oral: 8, 8, 6, 5)
- Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou
- Key: teacher-student shared control, safety guarantee and exploration guidance, trajectory-based value estimation
- ExpEnv: MetaDrive

NeurIPS 2022

(Click to Collapse)

Redeeming Intrinsic Rewards via Constrained Optimization (Poster: 8, 7, 7)
- Eric Chen, Zhang-Wei Hong, Joni Pajarinen, Pulkit Agrawal
- Key: automatically tunes the importance of the intrinsic reward, principled constrained policy optimization procedure
- ExpEnv: Atari
You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping (Poster: 6, 6, 5, 5)
- Annie S. Chen, Archit Sharma, Sergey Levine, Chelsea Finn
- Key: single-life reinforcement learning, Q-weighted adversarial learning (QWALE), distribution matching strategy
- ExpEnv: Tabletop-Organization, Pointmass, modified HalfCheetah, modified Franka-Kitchen
Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation (Poster: 8, 7, 6)
- Cansu Sancaktar, Sebastian Blaes, Georg Martius
- Key: self-reinforcing cycle between good models and good exploration, zero-shot generalization to downstream tasks via model-based planning
- ExpEnv: Playground, Fetch Pick & Place Construction
Model-based Lifelong Reinforcement Learning with Bayesian Exploration (Poster: 7, 6, 6)
- Haotian Fu, Shangqun Yu, Michael Littman, George Konidaris
- Key: hierarchical Bayesian posterior
- ExpEnv: HiP-MDP versions of Mujoco, Meta-world
On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL (Poster: 7, 6, 5, 5)
- Jinglin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal
- Key: sample-efficient reward-free exploration, explorability or reachability assumptions
- ExpEnv: None
DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning (Poster: 8, 7, 4)
- Archana Bura, Aria Hasanzadezonuzy, Dileep Kalathil, Srinivas Shakkottai, Jean-Francois Chamberland
- Key: model-based safe RL, finite-horizon Constrained Markov Decision Process, reward bonus for exploration (optimism) with a conservative constraint (pessimism)
- ExpEnv: Factored CMDP environment
Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning
- Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu
- Key: Optimism in the face of uncertainty (OFU), Bayesian optimistic optimization
- ExpEnv: RiverSwim, Chain, Random MDPs.
Active Exploration for Inverse Reinforcement Learning (Poster: 7, 7, 7, 7)
- David Lindner, Andreas Krause, Giorgia Ramponi
- Key: actively explores an unknown environment and expert policy, does not require a generative model of the environment
- ExpEnv: Four Paths, Random MDPs, Double Chain, Chain, Gridworld
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards (Poster: 6, 6, 4)
- Rati Devidze, Parameswaran Kamalaruban, Adish Singla
- Key: reward shaping, intrinsic reward function, exploration-based bonuses.
- ExpEnv: Chain, Room, Linek
Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations (Poster: 6, 6, 5, 5)
- Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel S. Brown, Ken Goldberg
- Key: parameter free, the maximum of the standard TD target and a Monte Carlo estimate of the reward-to-go.
- ExpEnv: Pointmass Navigation, Block Extraction, Sequential Pushing, Door Opening, Block Lifting
Incentivizing Combinatorial Bandit Exploration (Poster: 7, 6, 5, 3)
- Xinyan Hu, Dung Daniel Ngo, Aleksandrs Slivkins, and Zhiwei Steven Wu
- Key: incentivized exploration, large,structured action sets and highly correlated beliefs, combinatorial semi-bandits.
- ExpEnv: None

ICML 2022

(Click to Collapse)

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses (Oral)
- Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard
- Key: Bayes-UCBVI, regret bound, quantile of a Q-value function posterior, anticoncentration inequality for a Dirichlet weighted sum
- ExpEnv: simple tabular grid-world env, Atari
The Importance of Non-Markovianity in Maximum State Entropy Exploration (Oral)
- Mirco Mutti, Riccardo De Santi, Marcello Restelli
- Key: maximum state entropy exploration, non-Markovianity, finite-sample regime
- ExpEnv: 3State, River Swim
Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning (Spotlight)
- Yunfei Li, Tian Gao, Jiaqi Yang, Huazhe Xu, Yi Wu
- Key: sparse-reward goal-conditioned, RL/SL phasic, task reduction
- ExpEnv: Sawyer Push, Ant Maze, Stacking
Thompson Sampling for (Combinatorial) Pure Exploration (Spotlight)
- Siwei Wang, Jun Zhu
- Key: combinatorial pure exploration, Thompson Sampling, lower complexity
- ExpEnv: combinatorial multi-armed bandit
Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path (Spotlight)
- Haoyuan Cai, Tengyu Ma, Simon Du
- Key: incremental autonomous exploration, stronger sample complexity bounds, multi-goal stochastic shortest path
- ExpEnv: hard MDP
Safe Exploration for Efficient Policy Evaluation and Comparison (Spotlight)
- Runzhe Wan, Branislav Kveton, Rui Song
- Key: efficient and safe data collection for bandit policy evaluation.
- ExpEnv: multi-armed bandit, contextual multi-armed bandit, linear bandits

ICLR 2022

(Click to Collapse)

The Information Geometry of Unsupervised Reinforcement Learning (Oral: 8, 8, 8)
- Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
- Key: unsupervised skill discovery, mutual information objective, adversarially-chosen reward functions
- ExpEnv: None
When should agents explore? (Spotlight: 8, 8, 6, 6)
- Miruna Pislar, David Szepesvari, Georg Ostrovski, Diana Borsa, Tom Schaul
- Key: mode-switching, non-monolithic exploration, intra-episodic exploration
- ExpEnv: Atari
Learning more skills through optimistic exploration (Spotlight: 8, 8, 8, 6)
- DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, Steven Hansen
- Key: discriminator disagreement intrinsic reward, information gain auxiliary objective
- ExpEnv: tabular grid world, Atari
Learning Long-Term Reward Redistribution via Randomized Return Decomposition (Spotlight: 8, 8, 8, 5)
- Zhizhou Ren, Ruihan Guo, Yuan Zhou, Jian Peng
- Key: sparse and delayed rewards, randomized return decomposition
- ExpEnv: MuJoCo
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration (Spotlight: 8, 8, 8, 6, 6)
- Desik Rengarajan, Gargi Vaidya, Akshay Sarvesh, Dileep Kalathil, Srinivas Shakkottai
- Key: learning online with guidance offline
- ExpEnv: MuJoCo, TurtleBot (Waypoint tracking, Obstacle avoidance)
Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning (Spotlight: 8, 8, 8, 6)
- Haichao Zhang, Wei Xu, Haonan Yu
- Key: generative planning method, temporally coordinated exploration, crude initial plan
- ExpEnv: classic continuous control env, CARLA
Learning Altruistic Behaviours in Reinforcement Learning without External Rewards (Spotlight: 8, 8, 6, 6)
- Tim Franzmeyer, Mateusz Malinowski, João F. Henriques
- Key: altruistic behaviour, task-agnostic
- ExpEnv: grid world env, foraging, multi-agent tag
Anti-Concentrated Confidence Bonuses for Scalable Exploration (Poster: 8, 6, 5)
- Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham Kakade
- Key: anti-concentrated confidence bounds, elliptical bonus
- ExpEnv: multi-armed bandit, Atari
Lipschitz-constrained Unsupervised Skill Discovery (Poster: 8, 6, 6, 6)
- Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, Gunhee Kim
- Key: unsupervised skill discovery, Lipschitz-constrained
- ExpEnv: MuJoCo
LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning (Poster: 8, 6, 5, 5)
- David Henry Mguni, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Oliver Slumbers, Feifei Tong, Yang Li, Jiangcheng Zhu, Yaodong Yang, Jun Wang
- Key: multi-agent, coordinated exploration and behaviour, learnable intrinsic-reward generation selection, switching controls
- ExpEnv: foraging, StarCraft II
Multi-Stage Episodic Control for Strategic Exploration in Text Games (Spotlight: 8, 8, 6, 6)
- Jens Tuyls, Shunyu Yao, Sham M. Kakade, Karthik R Narasimhan
- Key: multi-stage approach, policy decomposition
- ExpEnv: Jericho
On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning (Poster: 8, 8, 5, 5)
- Che Wang, Shuhan Yuan, Kai Shao, Keith Ross
- Key: Monte Carlo exploring starts, optimal policy feed-forward MDPs
- ExpEnv: blackjack, cliff Walking

NeurIPS 2021

(Click to Collapse)

Interesting Object, Curious Agent: Learning Task-Agnostic Exploration (Oral: 9, 8, 8, 8)
- Simone Parisi, Victoria Dean,Deepak Pathak, Abhinav Gupta
- Key: task-agnostic exploration, agent-centric component, environment-centric component
- ExpEnv: MiniGrid, Habitat
Tactical Optimism and Pessimism for Deep Reinforcement Learning (Poster: 9, 7, 6, 6)
- Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, Michael Jordan
- Key: Tactical Optimistic and Pessimistic estimation, multi-arm bandit problem
- ExpEnv: MuJoCo
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? (Poster: 7, 6, 6, 5)
- Kate Rakelly, Abhishek Gupta,Carlos Florensa, Sergey Levine
- Key: mutual information objectives, sufficiency of a state representation
- ExpEnv: catcher, catcher-grip
On the Theory of Reinforcement Learning with Once-per-Episode Feedback (Poster: 6, 5, 5, 4)
- Niladri S. Chatterji, Aldo Pacchiano, Peter L. Bartlett, Michael I. Jordan
- Key: binary feedback, sublinear regret
- ExpEnv: None
MADE: Exploration via Maximizing Deviation from Explored Regions (Poster: 7, 7, 6, 5)
- Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph Gonzalez, Stuart Russell
- Key: maximizing deviation from the explored regions, intrinsic reward
- ExpEnv: MiniGrid, DeepMind Control Suite
Adversarial Intrinsic Motivation for Reinforcement Learning (Poster: 7, 7, 6)
- Ishan Durugkar, Mauricio Tec, Scott Niekum, Peter Stone
- Key: the Wasserstein-1 distance, goal-conditioned, quasimetric, adversarial intrinsic motivation
- ExpEnv: Grid World, Fetch Robot (based on MuJoCo)
Information Directed Reward Learning for Reinforcement Learning (Poster: 9, 8, 7, 6)
- David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
- Key: expert queries, Bayesian model of the reward, maximize the information gain
- ExpEnv: MuJoCo
Dynamic Bottleneck for Robust Self-Supervised Exploration (Poster: 8, 6, 6, 6)
- Chenjia Bai, Lingxiao Wang, Lei Han, Animesh Garg, Jianye Hao, Peng Liu, Zhaoran Wang
- Key: Dynamic Bottleneck, information gain
- ExpEnv: Atari
Hierarchical Skills for Efficient Exploration (Poster: 7, 6, 6, 6)
- Jonas Gehring, Gabriel Synnaeve, Andreas Krause, Nicolas Usunier
- Key: hierarchical skill learning, balance between generality and specificity, skills of varying complexity
- ExpEnv: Hurdles, Limbo, Stairs, GoalWall PoleBalance (based on MuJoCo)
Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality (spotlight: 8, 6, 6)
- Stefanos Leonardos, Georgios Piliouras, Kelly Spendlove
- Key: competitive multi-agent, balance between game rewards and exploration costs, unique quantal-response equilibrium
- ExpEnv: Two-Agent Weighted Zero-Sum Games
NovelD: A Simple yet Effective Exploration Criterion (Poster: 7, 6, 6, 6)
- Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian
- Key: weighting every novel area approximately equally
- ExpEnv: MiniGrid, NetHack, Atari
Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration (Poster: 7, 6, 6, 5)
- Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, Chongjie Zhang
- Key: episodic Multi-agent, curiosity-driven exploration, prediction errors, episodic memory
- ExpEnv: Predator-Prey, StarCraft II
Learning Diverse Policies in MOBA Games via Macro-Goals (Poster: 7, 6, 5, 5)
- Yiming Gao, Bei Shi, Xueying Du, Liang Wang, Guangwei Chen, Zhenjie Lian, Fuhao Qiu, Guonan Han, Weixuan Wang, Deheng Ye, Qiang Fu, Wei Yang, Lanxiao Huang
- Key: MOBA-game, policy diversity, Macro-Goals Guided framework, Meta-Controller, human demonstrations
- ExpEnv: honor of kings
CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery (not accepted now: 8, 8, 6, 3)
- Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, Pieter Abbeel
- Key: decomposition of the mutual information, particle estimator, contrastive learning
- ExpEnv: URLB

Classic Exploration RL Papers

(Click to Collapse)

Using Confidence Bounds for Exploitation-Exploration Trade-offs Journal of Machine Learning Research, 2002
- Peter Auer
- Key: linear contextual bandits
- ExpEnv: None

A Contextual-Bandit Approach to Personalized News Article Recommendation WWW 2010
- Lihong Li, Wei Chu, John Langford, Robert E. Schapire
- Key: LinUCB
- ExpEnv: Yahoo! Front Page Today Module dataset
(More) Efficient Reinforcement Learning via Posterior Sampling NeurIPS 2013
- Ian Osband, Benjamin Van Roy, Daniel Russo
- Key: prior distribution, posterior sampling
- ExpEnv: RiverSwim
An empirical evaluation of thompson sampling NeurIPS 2011
- Olivier Chapelle, Lihong Li
- Key: Thompson sampling, empirical results
- ExpEnv: None
A Tutorial on Thompson Sampling arxiv 2017
- Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
- Key: Thompson sampling
- ExpEnv: None
Unifying Count-Based Exploration and Intrinsic Motivation NeurIPS 2016
- Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos
- Key: intrinsic motivation, density models, pseudo-count
- ExpEnv: Atari
Deep Exploration via Bootstrapped DQN NeurIPS 2016
- Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
- Key: temporally-extended (or deep) exploration, randomized value functions, bootstrapped DQN
- ExpEnv: Atari
VIME: Variational information maximizing exploration NeurIPS 2016
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel
- Key: maximization of information gain, belief of environment dynamics, variational inference in Bayesian neural networks
- ExpEnv: rllab
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning NeurIPS 2017
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel
- Key: hash cont, intrinsic motivation
- ExpEnv: rllab, Atari
EX2: Exploration with Exemplar Models for Deep Reinforcement Learning NeurIPS 2017
- Justin Fu, John D. Co-Reyes, Sergey Levine
- Key: novelty detection, discriminatively trained exemplar models, implicit density estimation
- ExpEnv: VizDoom, Atari
Hindsight Experience Replay NeurIPS 2017
- Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba
- Key: hindsight experience replay, implicit curriculum
- ExpEnv: pushing, sliding, pick-and-place, physical robot
Curiosity-driven exploration by self-supervised prediction ICML 2017
- Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell
- Key: curiosity, self-supervised inverse dynamics model
- ExpEnv: VizDoom, Super Mario Bros
Deep Q-learning from Demonstrations AAAI 2018
- Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
- Key: combining temporal difference updates with supervised classification of the demonstrator’s actions
- ExpEnv: Atari
Noisy Networks For Exploration ICLR 2018
- Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg
- Key: learned parametric noise
- ExpEnv: Atari
Exploration by random network distillation ICLR 2018
- Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov
- Key: random network distillation
- ExpEnv: Atari
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor ICML 2018
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine
- Key: soft actor critic, maximum entropy, policy iteration
- ExpEnv: MuJoCo
Large-Scale Study of Curiosity-Driven Learning ICLR 2019
- Yuri Burda, Harri Edwards & Deepak Pathak, Amos Storkey, Trevor Darrell, Alexei A. Efros
- Key: curiosity, prediction error, purely curiosity-driven learning, feature spaces
- ExpEnv: Atari, Super Mario Bros
Diversity is all you need: Learning skills without a reward function ICLR 2019
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine
- Key: maximizing an information theoretic objective, unsupervised emergence of diverse skills
- ExpEnv: MuJoCo
Episodic Curiosity through Reachability ICLR 2019
- Nikolay Savinov, Anton Raichuk, Rapha¨el Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, Sylvain Gelly
- Key: curiosity, episodic memory, how many environment steps it takes to reach the current observation
- ExpEnv: VizDoom, DMLab, MuJoCo
Self-Supervised Exploration via Disagreement ICML 2019
- Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta
- Key: ensemble of dynamics models, maximize the disagreement of those ensembles, differentiable manner
- ExpEnv: Noisy MNIST, Atari, MuJoCo, Unity, real robot
EMI: Exploration with Mutual Information ICML 2019
- Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, Hyun Oh Song
- Key: embedding representation of states and actions, forward prediction, mutual information
- ExpEnv: Atari, MuJoCo
Making Efficient Use of Demonstrations to Solve Hard Exploration Problems arxiv 2019
- Caglar Gulcehre, Tom Le Paine, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas
- Key: R2D2, makes efficient use of demonstrations, hard exploration problems
- ExpEnv: Atari
Optimistic Exploration even with a Pessimistic Initialisation ICLR 2020
- Tabish Rashid, Bei Peng, Wendelin Böhmer, Shimon Whiteson
- Key: pessimistically initialised Q-values, count-derived bonuses, optimism during both action selection and bootstrapping
- ExpEnv: randomised chain, Maze, Montezuma’s Revenge
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments ICLR 2020
- Roberta Raileanu, Tim Rocktäschel
- Key: lead to significant changes in its learned state representation
- ExpEnv: MiniGrid
Never give up: Learning directed exploration strategies ICLR 2020
- Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, Charles Blundell
- Key: ICM+RND, different degrees of exploration/exploitation
- ExpEnv: Atari
Agent57: Outperforming the atari human benchmark ICML 2020
- Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell
- Key: parameterizes a family of policies, adaptive mechanism, state-action value function parameterization
- ExpEnv: Atari, roboschool
Neural Contextual Bandits with UCB-based Exploration ICML 2020
- Dongruo Zhou, Lihong Li, Quanquan Gu
- Key: stochastic contextual bandit, neural network-based random feature, near-optimal regret guarantee
- ExpEnv: contextual bandits, UCI Machine Learning Repository, MNIST
Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments ICLR 2021
- Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, Ji Liu
- Key: procedurally-generated environments, episodic exploration score from both per-episode and long-term views
- ExpEnv: MiniGrid, MiniWorld, MuJoCo
First return then explore Nature 2021
- Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune
- Key: detachment and derailment, remembering states, returning to them, and exploring from them
- ExpEnv: Atari, pick-and-place robotics task

Contributing

Our purpose is to provide a starting paper guide to who are interested in exploration methods in RL. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome Exploration RL is released under the Apache 2.0 license.

(Back to top)

opendilab / awesome-exploration-rl Goto Github PK

awesome-exploration-rl's Introduction

Awesome Exploration Methods in Reinforcement Learning

Table of Contents

A Taxonomy of Exploration RL Methods

Papers

ICLR 2024

NeurIPS 2023

ICML 2023

ICLR 2023

NeurIPS 2022

ICML 2022

ICLR 2022

NeurIPS 2021

Classic Exploration RL Papers

Contributing

License

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent