Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning

Yi Shen,Hanyan Huang,Shan Xie
2024-04-03
Abstract:Offline reinforcement learning learns from a static dataset without interacting with the environment, which ensures security and thus owns a good prospect of application. However, directly applying naive reinforcement learning methods usually fails in an offline environment due to function approximation errors caused by out-of-distribution(OOD) actions. To solve this problem, existing algorithms mainly penalize the Q-value of OOD actions, the quality of whose constraints also matter. Imprecise constraints may lead to suboptimal solutions, while precise constraints require significant computational costs. In this paper, we propose a novel count-based method for continuous domains, called Grid-Mapping Pseudo-Count method(GPC), to penalize the Q-value appropriately and reduce the computational cost. The proposed method maps the state and action space to discrete space and constrains their Q-values through the pseudo-count. It is theoretically proved that only a few conditions are needed to obtain accurate uncertainty constraints in the proposed method. Moreover, we develop a Grid-Mapping Pseudo-Count Soft Actor-Critic(GPC-SAC) algorithm using GPC under the Soft Actor-Critic(SAC) framework to demonstrate the effectiveness of GPC. The experimental results on D4RL benchmark datasets show that GPC-SAC has better performance and less computational cost compared to other algorithms.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the out - of - distribution (OOD) problem in offline reinforcement learning (offline RL). Specifically, when directly applying traditional reinforcement learning methods in an offline environment, due to the existence of actions or states not covered in the static dataset, it will lead to inaccurate Q - value estimation, thus affecting the performance of the policy. In offline RL, the Q - function approximator can only make reliable estimates for the data in the static dataset, and for the out - of - distribution state - action pairs (OOD state - action pairs), it may give inaccurate Q - value estimates. This makes the policy may utilize sub - optimal actions brought by inaccurate estimates, resulting in performance degradation. Existing algorithms mainly solve this problem by penalizing the Q - values of OOD actions, but these methods may have problems such as imprecise constraints or high computational costs. To solve these problems, the author proposes a new counting - based method - Grid - Mapping Pseudo - Count method (GPC) for offline RL in the continuous domain. This method maps the state and action spaces to discrete spaces and constrains the Q - values through pseudo - counting, thereby reducing computational costs and improving performance. ### Specific improvement points 1. **Reduce computational costs**: Compared with using complex models such as auto - encoders, GPC simplifies the calculation process through the meshing method and reduces additional network calculations. 2. **More precise uncertainty constraints**: GPC can obtain accurate uncertainty constraints with fewer assumptions, thus avoiding sub - optimal solutions caused by imprecise constraints. 3. **Theoretical proof**: It is theoretically proven that GPC only needs a few conditions to obtain accurate uncertainty constraints, ensuring the effectiveness of the method. 4. **Experimental verification**: The experimental results on the D4RL benchmark dataset show that GPC - SAC has better performance and lower computational costs compared with other algorithms. ### Conclusion This paper proposes a new grid - mapping - based pseudo - counting method (GPC) for the OOD problem in offline reinforcement learning. By mapping the state and action spaces to discrete spaces and using pseudo - counting to constrain Q - values, GPC can effectively reduce computational costs and improve performance. Experimental results show that GPC - SAC outperforms existing methods on multiple tasks.