Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Wenzhuo Liu,Shuying Xiang,Tao Zhang,Yanan Han,Xingxing Guo,Yahui Zhang,Yue Hao
DOI: https://doi.org/10.1007/s00521-024-09839-z
2024-08-25
Neural Computing and Applications
Abstract:Recent advancements in offline reinforcement learning (offline RL) have leveraged the Q-ensemble approach to derive optimal policies from static datasets collected in the past. By increasing the batch size, a portion of Q-ensemble instances penalizing out-of-distribution (OOD) data can be replaced, significantly reducing the Q-ensemble size while maintaining comparable performance and expediting the algorithm's training. To further enhance the Q-ensembles' ability to penalize OOD data, a technique involving large batch punishment and a binary classification network was employed. This method differentiates in-distribution (ID) data from OOD data. For ID data, positive adjustments to Q values were made (reward-based adjustment), whereas negative adjustments (penalty-based adjustment) were applied for OOD data, which replaced some OOD data punishment within large Q-ensembles, reducing their size without compromising performance. For different tasks on the D4RL benchmark datasets, we selectively use one of its methods. Experimental results demonstrated that employing reward-based adjustment improved algorithm performance. Simultaneously, utilizing penalty-based adjustment reduced Q-ensemble size without compromising performance. In comparison to LB-SAC, this approach reduced average convergence time by 38% for datasets utilizing penalty-based adjustment, thanks to the introduction of a simpler binary classification network and a reduced number of Q networks.
computer science, artificial intelligence
What problem does this paper attempt to address?