Deep Reinforcement Learning for Bandit Arm Localization.

Wenbin Du,Huaqing Jin,Chao Yu,Guosheng Yin
DOI: https://doi.org/10.1109/bigdata55660.2022.10020647
IF: 4.426
2022-01-01
Big Data
Abstract:In the multi-armed bandit (MAB) framework, we investigate the problem of learning the means of distributions that are associated with a finite n umber o f a rms under a monotonic constraint. Different from the traditional MAB, our problem involves a parameter constraint and a limited trial budget (i.e., the number of arm pulls is small). However, the number of training samples can be as large as possible through (infinite) simulations, while each training sample is of limited size. This situation arises when some additional information is provided before the trial starts and each arm pull (or testing) could be of extraordinary cost. For example, in cancer dose-finding clinical trials, higher toxicity probabilities are typically associated with higher dose levels (i.e., the monotonic dose–toxicity constraint), and the loss due to the drug’s toxicity, side-effects or death of patients can be enormous. We formulate this problem in the reinforcement learning (RL) paradigm, which is referred to as a bandit arm localization problem. We propose a novel approach in a double deep Q-learning framework, which is integrated with a state-of-the-art statistical model to preserve the parameter constraint and develop a more effective learning strategy. The double deep Q-learning model can be trained with a large (can be as large as infinite) number of simulated trials, which is the first time to cast dose finding in the RL framework. We evaluate the performance of our approach through extensive simulation studies in realistic settings of phase I clinical trials. The proposed double deep Q-learning is shown to outperform the baseline methods in cancer dose-finding trials.
What problem does this paper attempt to address?