Conservative In-Distribution Q-Learning for Offline Reinforcement Learning

Zhengdao Shao,Liansheng Zhuang,Jie Yan,Liting Chen
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650768
2024-01-01
Abstract:Offline Reinforcement Learning (RL) aims to learn policies from pre-collected datasets without any additional interaction. In order to perform well and robustly in dynamic environments with noise or disturbances, the learned value function and derived policy should generalize well within and near the dataset distribution, rather than ‘over-fitting’ to training samples. To meet this requirement, we propose a new approach called Conservative In-Distribution Q-learning (CIDQL) that takes a step towards in-distribution offline RL. CIDQL is designed to learn in-distribution with respect to the dataset, using a perturbation-based interpolation technique and a quantile method for value regularization. It prohibits bootstrapping during value iteration, ensuring stable Q-value learning that is separated from policy improvement. The approach has theoretical guarantees for both Q-value underestimation and non-underestimation, and outperforms most SOTA algorithms on D4RL gym-MuJoCo benchmarks.
What problem does this paper attempt to address?