Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Riashat Islam,Komal K. Teru,Deepak Sharma,Joelle Pineau
DOI: https://doi.org/10.48550/arXiv.1911.06970
2019-12-01
Abstract:Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit{state distribution shift} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first estimate the state distribution based on features of the state, using a density estimator and then develop a novel constrained off-policy gradient objective that minimizes the state distribution shift. Our experimental results on continuous control tasks show that minimizing this distribution mismatch can significantly improve performance in most popular practical off-policy policy gradient algorithms.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in off - policy reinforcement learning, the ability to learn from batch offline data is limited by state distribution shift. Specifically, since the past data does not match the data distribution under the current policy, most off - policy learning methods cannot learn effectively when only using offline data. This mismatch in state distribution will lead to extrapolation error, which in turn affects the performance of the algorithm. ### Background of the Paper Off - policy deep reinforcement learning (DRL) algorithms usually need to interact with the environment online to collect data, because relying solely on batch offline data will lead to performance degradation. This is because the offline data may be different from the data distribution under the current policy, thus causing state distribution shift. This shift will cause the algorithm to have extrapolation error when trying to learn from offline data, that is, the algorithm performs poorly on unseen states. ### Main Contributions 1. **Analyzing the Impact of State Distribution Shift**: - The authors first systematically analyzed the impact of state distribution shift on off - policy learning. Through experiments, they found that as the difference between the offline data and the data distribution under the current policy increases, the performance of the algorithm decreases significantly. 2. **Proposing a New Constraint Method**: - In order to alleviate the problem of state distribution shift, the authors proposed a new constraint method to improve off - policy learning by minimizing the state distribution difference between the behavior policy and the target policy. Specifically, they use density estimators (such as variational auto - encoders, VAE) to estimate the state distribution and use the KL - divergence (Kullback - Leibler divergence) to constrain the state distribution shift. 3. **Experimental Verification**: - The authors conducted experiments on multiple continuous control tasks and verified that their method can significantly improve the performance of existing off - policy algorithms (such as DDPG, TD3, and SAC). The experimental results show that by minimizing the state distribution shift, the performance of the algorithm can be improved without increasing additional samples. ### Method Overview 1. **State Distribution Estimation**: - Use density estimators (such as VAE) to estimate the state distributions \( d_\pi(s) \) and \( d_\mu(s) \) based on state features, where \( d_\pi(s) \) is the state distribution under the target policy and \( d_\mu(s) \) is the state distribution under the behavior policy. 2. **Constraining State Distribution Shift**: - Introduce a KL - divergence term in the policy gradient update as a regularization term to minimize the state distribution shift. The specific objective function is: \[ J(\pi_\theta) = \mathbb{E} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t(s, a)-\text{KL}(d_\mu(s_\mu)\|d_{\pi_\theta+\epsilon}(s_\pi)) \right] \] - where \(\text{KL}(d_\mu(s_\mu)\|d_{\pi_\theta+\epsilon}(s_\pi)) \) represents the KL - divergence of the state distribution between the behavior policy and the target policy. 3. **Experimental Setup**: - The authors conducted experiments on multiple MuJoCo benchmark tasks and compared the performance of their method with that of baseline algorithms (such as DDPG, TD3, and SAC). The results show that by minimizing the state distribution shift, the performance can be significantly improved in multiple environments. ### Conclusion By introducing the constraint of state distribution shift, this paper provides an effective method to improve the performance of off - policy reinforcement learning algorithms. This method not only performs well in standard off - policy learning tasks but also shows its advantages in off - line learning tasks with fixed - batch data. This research provides new ideas and technical means for the field of off - policy reinforcement learning.