Abstract:Offline reinforcement learning learns from a static dataset without interacting with the environment, which ensures security and thus owns a good prospect of application. However, directly applying naive reinforcement learning methods usually fails in an offline environment due to function approximation errors caused by out-of-distribution(OOD) actions. To solve this problem, existing algorithms mainly penalize the Q-value of OOD actions, the quality of whose constraints also matter. Imprecise constraints may lead to suboptimal solutions, while precise constraints require significant computational costs. In this paper, we propose a novel count-based method for continuous domains, called Grid-Mapping Pseudo-Count method(GPC), to penalize the Q-value appropriately and reduce the computational cost. The proposed method maps the state and action space to discrete space and constrains their Q-values through the pseudo-count. It is theoretically proved that only a few conditions are needed to obtain accurate uncertainty constraints in the proposed method. Moreover, we develop a Grid-Mapping Pseudo-Count Soft Actor-Critic(GPC-SAC) algorithm using GPC under the Soft Actor-Critic(SAC) framework to demonstrate the effectiveness of GPC. The experimental results on D4RL benchmark datasets show that GPC-SAC has better performance and less computational cost compared to other algorithms.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the out - of - distribution (OOD) problem in offline reinforcement learning (offline RL). Specifically, when directly applying traditional reinforcement learning methods in an offline environment, due to the existence of actions or states not covered in the static dataset, it will lead to inaccurate Q - value estimation, thus affecting the performance of the policy. In offline RL, the Q - function approximator can only make reliable estimates for the data in the static dataset, and for the out - of - distribution state - action pairs (OOD state - action pairs), it may give inaccurate Q - value estimates. This makes the policy may utilize sub - optimal actions brought by inaccurate estimates, resulting in performance degradation. Existing algorithms mainly solve this problem by penalizing the Q - values of OOD actions, but these methods may have problems such as imprecise constraints or high computational costs. To solve these problems, the author proposes a new counting - based method - Grid - Mapping Pseudo - Count method (GPC) for offline RL in the continuous domain. This method maps the state and action spaces to discrete spaces and constrains the Q - values through pseudo - counting, thereby reducing computational costs and improving performance. ### Specific improvement points 1. **Reduce computational costs**: Compared with using complex models such as auto - encoders, GPC simplifies the calculation process through the meshing method and reduces additional network calculations. 2. **More precise uncertainty constraints**: GPC can obtain accurate uncertainty constraints with fewer assumptions, thus avoiding sub - optimal solutions caused by imprecise constraints. 3. **Theoretical proof**: It is theoretically proven that GPC only needs a few conditions to obtain accurate uncertainty constraints, ensuring the effectiveness of the method. 4. **Experimental verification**: The experimental results on the D4RL benchmark dataset show that GPC - SAC has better performance and lower computational costs compared with other algorithms. ### Conclusion This paper proposes a new grid - mapping - based pseudo - counting method (GPC) for the OOD problem in offline reinforcement learning. By mapping the state and action spaces to discrete spaces and using pseudo - counting to constrain Q - values, GPC can effectively reduce computational costs and improve performance. Experimental results show that GPC - SAC outperforms existing methods on multiple tasks.

Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

UAC: Offline Reinforcement Learning with Uncertain Action Constraint

Constraints Penalized Q-learning for Safe Offline Reinforcement Learning.

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Adaptable Conservative Q-Learning for Offline Reinforcement Learning.

Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning

Robust Offline Reinforcement Learning from Low-Quality Data

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

In-sample Actor Critic for Offline Reinforcement Learning

Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression

OGBench: Benchmarking Offline Goal-Conditioned RL

State Deviation Correction for Offline Reinforcement Learning

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Strategically Conservative Q-Learning

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

Offline Reinforcement Learning With Behavior Value Regularization

Boosting Offline Reinforcement Learning with Action Preference Query

Robot Crowd Navigation in Dynamic Environment with Offline Reinforcement Learning