Boosting Offline Reinforcement Learning with Action Preference Query

Qisen Yang,Shenzhi Wang,Matthieu Gaetan Lin,Shiji Song,Gao Huang
2023-06-06
Abstract:Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy's performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy's performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the effect of Offline Reinforcement Learning (Offline RL) without any online interaction in high - risk scenarios, such as healthcare and self - driving. Although traditional Offline - to - Online schemes can improve performance through limited online fine - tuning, even a small amount of online interaction may lead to catastrophic consequences. Therefore, the paper proposes a new training scheme named Offline - with - Action - Preferences (OAP), which aims to adaptively adjust policy constraints by querying action preferences, thereby improving the performance of offline policies without online interaction. Specifically, OAP achieves this goal through the following steps: 1. **Action - preference query**: Select some samples from the offline dataset and query the action preferences of these samples through a black - box preference model. This process does not require actual interaction with the environment. 2. **Pseudo - query and RankNet**: Use the data obtained from the query to train a RankNet model, which is used to perform pseudo - queries on the remaining unqueried data to obtain more preference information. 3. **Adjust policy constraints**: Dynamically adjust the optimization direction of the policy according to the queried action preferences. If the action generated by the current policy is better than the action in the dataset, relax the constraint; otherwise, strengthen the constraint to avoid the negative impact of distribution shift. In this way, OAP can significantly improve the learning effect of offline policies while maintaining safety. Especially in some challenging tasks, such as AntMaze and Adroit tasks, the performance of OAP far exceeds that of other methods. In addition, theoretical analysis has proven that OAP can stably improve the performance of the behavior policy, and can effectively improve performance even if there are errors in preference labeling.