Abstract:Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the safety problems in **Offline Reinforcement Learning (Offline RL)**, especially when only using static datasets during the training process. Specifically, the author focuses on how to ensure that the policy complies with safety constraints while maximizing the cumulative reward, especially when encountering Out - of - Distribution (OOD) states and actions, to avoid safety violations or overly conservative behaviors. #### Main challenges 1. **Balancing safety and performance**: In offline RL, the policy needs to ensure that it does not violate safety constraints while maximizing the cumulative reward. This is especially difficult because the offline dataset may contain high - reward but unsafe trajectories. 2. **Handling Out - of - Distribution (OOD) states and actions**: States and actions not seen in the offline dataset may lead to safety violations or overly conservative policies. 3. **Dealing with Tempting Datasets**: These datasets are mainly composed of high - reward but unsafe trajectories, which increases the difficulty of learning a safe policy. #### Solutions To solve the above problems, the author proposes the **Feasibility Informed Advantage Weighted Actor - Critic (FAWAC)** method. The core idea of FAWAC is to optimize the policy by introducing feasibility conditions to ensure safe updates in the non - parameterized policy space, and then project it into the parameterized space for constrained actor training. Specific improvements include: - **Introducing the Cost - Advantage Term**: Add a cost - advantage term to Advantage Weighted Regression (AWR) to ensure that safety constraints are respected. - **Handling Tempting Datasets**: Propose a strategy to deal with datasets mainly composed of high - reward but unsafe trajectories, ensuring that high performance can be maintained and safety constraints can be complied with during the learning process. #### Experimental verification The author conducted experimental evaluations on standard benchmark tasks, and the results show that FAWAC performs well in balancing safety and performance, especially when dealing with safety constraints in static datasets. ### Formula summary - **CMDP definition**: \[ M=(S, A, P, r, c, \gamma, \rho_0) \] where \(S\) and \(A\) represent the state space and action space respectively, \(P(s'|s, a)\) is the transition probability function, \(r(s, a)\) is the reward function, \(c(s, a)\) is the cost function, \(\gamma\in[0, 1)\) is the discount factor, and \(\rho_0\) is the initial state distribution. - **Optimization objective**: \[ \max_{\pi} V^{\pi}_r(s),\quad\text{s.t.}, V^{\pi}_c(s)\leq\kappa;\quad D_{KL}(\pi||\pi_{\beta})\leq\delta \] where \(\pi_{\beta}\) is the behavior policy and \(\delta\) is the tolerance parameter of KL divergence. - **Optimal policy form**: \[ \pi^*(a|s)=\frac{1}{Z(s)}\pi_{\beta}(a|s)\exp\left(\frac{A^{\pi_k}(s, a)-\nu A^{\pi_k}_c(s, a)}{\lambda}\right) \] where \(Z(s)\) is the normalization factor, and \(\nu\) and \(\lambda\) are Lagrange multipliers. Through these methods, FAWAC can effectively achieve persistent safety in offline reinforcement learning.

FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

FOSP: Fine-tuning Offline Safe Policy through World Models

Feasible Actor-Critic: Constrained Reinforcement Learning for Ensuring Statewise Safety

Robust Offline Reinforcement Learning from Low-Quality Data

LAPO: Latent-Variable Advantage-Weighted Policy Optimization for Offline Reinforcement Learning.

Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation.

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

Offline Meta-Reinforcement Learning with Advantage Weighting

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning.

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

UAC: Offline Reinforcement Learning with Uncertain Action Constraint

Constraints Penalized Q-learning for Safe Offline Reinforcement Learning.

Safe Reinforcement Learning with Dead-Ends Avoidance and Recovery

SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics

Safety-Aware Causal Representation for Trustworthy Offline Reinforcement Learning in Autonomous Driving