An OCBA-Based Method for Efficient Sample Collection in Reinforcement Learning
Kuo Li,Xinze Jin,Qing-Shan Jia,Dongchun Ren,Huaxia Xia
DOI: https://doi.org/10.1109/tase.2023.3282257
IF: 6.636
2023-01-01
IEEE Transactions on Automation Science and Engineering
Abstract:This work focuses on the sample collection in reinforcement learning (RL), where the interaction with the environment is typically time-consuming and extravagantly expensive. In order to collect samples in a more valuable way, we propose a confidence-based sampling strategy based on the optimal computing budget allocation algorithm (OCBA), which actively allocates the computing efforts to actions with different predictive uncertainties. We estimate the uncertainty with ensembles and generalize them from tabular representations to function approximations. The OCBA-based sampling strategy could be easily integrated into various off-policy RL algorithms, where we take Q-learning, DQN, and SAC as examples to show the incorporation. Besides, we provide the theoretical analysis towards convergence and evaluate the algorithms experimentally. According to the experiments, the incorporated algorithms obtain remarkable gains compared with modern ensemble-based RL algorithms. Note to Practitioners-Reinforcement learning is a powerful tool for handling sequential decision-making problems, e.g., autonomous driving and robotics control, where the behaviors typically have a long-term effect on future events. However, although RL achieves human-level control in some tasks, it severely suffers from low sample efficiency. Therefore, implementing RL in some practical areas, e.g., healthcare and rescue, is extremely hard due to the requirement of massive samples. This work aims to enhance the exploration of RL by incorporating OCBA, which provides an asymptotically optimal data-collection strategy for simulation-based optimization. Based on ensemble-based uncertainty estimation and OCBA-based action selection, the incorporated RL algorithms show competitive performance on many benchmarks and significantly reduce the sampling efforts during iterations.