Abstract:Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the safety problems in **Offline Reinforcement Learning (Offline RL)**, especially when only using static datasets during the training process. Specifically, the author focuses on how to ensure that the policy complies with safety constraints while maximizing the cumulative reward, especially when encountering Out - of - Distribution (OOD) states and actions, to avoid safety violations or overly conservative behaviors.
#### Main challenges
1. **Balancing safety and performance**: In offline RL, the policy needs to ensure that it does not violate safety constraints while maximizing the cumulative reward. This is especially difficult because the offline dataset may contain high - reward but unsafe trajectories.
2. **Handling Out - of - Distribution (OOD) states and actions**: States and actions not seen in the offline dataset may lead to safety violations or overly conservative policies.
3. **Dealing with Tempting Datasets**: These datasets are mainly composed of high - reward but unsafe trajectories, which increases the difficulty of learning a safe policy.
#### Solutions
To solve the above problems, the author proposes the **Feasibility Informed Advantage Weighted Actor - Critic (FAWAC)** method. The core idea of FAWAC is to optimize the policy by introducing feasibility conditions to ensure safe updates in the non - parameterized policy space, and then project it into the parameterized space for constrained actor training. Specific improvements include:
- **Introducing the Cost - Advantage Term**: Add a cost - advantage term to Advantage Weighted Regression (AWR) to ensure that safety constraints are respected.
- **Handling Tempting Datasets**: Propose a strategy to deal with datasets mainly composed of high - reward but unsafe trajectories, ensuring that high performance can be maintained and safety constraints can be complied with during the learning process.
#### Experimental verification
The author conducted experimental evaluations on standard benchmark tasks, and the results show that FAWAC performs well in balancing safety and performance, especially when dealing with safety constraints in static datasets.
### Formula summary
- **CMDP definition**:
\[
M=(S, A, P, r, c, \gamma, \rho_0)
\]
where \(S\) and \(A\) represent the state space and action space respectively, \(P(s'|s, a)\) is the transition probability function, \(r(s, a)\) is the reward function, \(c(s, a)\) is the cost function, \(\gamma\in[0, 1)\) is the discount factor, and \(\rho_0\) is the initial state distribution.
- **Optimization objective**:
\[
\max_{\pi} V^{\pi}_r(s),\quad\text{s.t.}, V^{\pi}_c(s)\leq\kappa;\quad D_{KL}(\pi||\pi_{\beta})\leq\delta
\]
where \(\pi_{\beta}\) is the behavior policy and \(\delta\) is the tolerance parameter of KL divergence.
- **Optimal policy form**:
\[
\pi^*(a|s)=\frac{1}{Z(s)}\pi_{\beta}(a|s)\exp\left(\frac{A^{\pi_k}(s, a)-\nu A^{\pi_k}_c(s, a)}{\lambda}\right)
\]
where \(Z(s)\) is the normalization factor, and \(\nu\) and \(\lambda\) are Lagrange multipliers.
Through these methods, FAWAC can effectively achieve persistent safety in offline reinforcement learning.