Q-value Regularized Transformer for Offline Reinforcement Learning

Shengchao Hu,Ziqing Fan,Chaoqin Huang,Li Shen,Ya Zhang,Yanfeng Wang,Dacheng Tao
DOI: https://doi.org/10.48550/arxiv.2405.17098
2024-01-01
Abstract:Recent advancements in offline reinforcement learning (RL) have underscoredthe capabilities of Conditional Sequence Modeling (CSM), a paradigm that learnsthe action distribution based on history trajectory and target returns for eachstate. However, these methods often struggle with stitching together optimaltrajectories from sub-optimal ones due to the inconsistency between the sampledreturns within individual trajectories and the optimal returns across multipletrajectories. Fortunately, Dynamic Programming (DP) methods offer a solution byleveraging a value function to approximate optimal future returns for eachstate, while these techniques are prone to unstable learning behaviors,particularly in long-horizon and sparse-reward scenarios. Building upon theseinsights, we propose the Q-value regularized Transformer (QT), which combinesthe trajectory modeling ability of the Transformer with the predictability ofoptimal future returns from DP methods. QT learns an action-value function andintegrates a term maximizing action-values into the training loss of CSM, whichaims to seek optimal actions that align closely with the behavior policy.Empirical evaluations on D4RL benchmark datasets demonstrate the superiority ofQT over traditional DP and CSM methods, highlighting the potential of QT toenhance the state-of-the-art in offline RL.
What problem does this paper attempt to address?