Abstract:When a reinforcement learning (RL) method has to decide between several optional policies by solely looking at the received reward, it has to implicitly optimize a Multi-Armed-Bandit (MAB) problem. This arises the question: are current RL algorithms capable of solving MAB problems? We claim that the surprising answer is no. In our experiments we show that in some situations they fail to solve a basic MAB problem, and in many common situations they have a hard time: They suffer from regression in results during training, sensitivity to initialization and high sample complexity. We claim that this stems from variance differences between policies, which causes two problems: The first problem is the "Boring Policy Trap" where each policy have a different implicit exploration depends on its rewards variance, and leaving a boring, or low variance, policy is less likely due to its low implicit exploration. The second problem is the "Manipulative Consultant" problem, where value-estimation functions used in deep RL algorithms such as DQN or deep Actor Critic methods, maximize estimation precision rather than mean rewards, and have a better loss in low-variance policies, which cause the network to converge to a sub-optimal policy. Cognitive experiments on humans showed that noised reward signals may paradoxically improve performance. We explain this using the aforementioned problems, claiming that both humans and algorithms may share similar challenges in decision making. Inspired by this result, we propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies, thus avoiding the two problems without affecting the environment's mean rewards behavior. We demonstrate that the ASRN scheme can dramatically improve the results.

Deep Reinforcement Learning for Bandit Arm Localization.

The Bandit Whisperer: Communication Learning for Restless Bandits

Quantum Reinforcement Learning for Multi-Armed Bandits

A Deep Bayesian Bandits Approach for Anticancer Therapy: Exploration via Functional Prior

Asynchronous Localization for Underwater Acoustic Sensor Networks: A Continuous Control Deep Reinforcement Learning Approach

Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints

Deep reinforcement learning and its applications in medical imaging and radiation therapy: a survey

Deep Reinforcement Learning for Dynamic Treatment Regimes on Medical Registry Data

Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health

Combinatorial Multi-Armed Bandit: General Framework and Applications.

Best Arm Identification in Batched Multi-armed Bandit Problems

Multimodal Deep Reinforcement Learning with Auxiliary Task for Obstacle Avoidance of Indoor Mobile Robot

Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

Can Q-learning solve Multi Armed Bantids?

Multiarmed Bandits Problem Under the Mean-Variance Setting

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Learning Visual Tracking and Reaching with Deep Reinforcement Learning on a UR10e Robotic Arm