RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

Zelei Cheng,Xian Wu,Jiahao Yu,Sabrina Yang,Gang Wang,Xinyu Xing
2024-06-06
Abstract:Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
Machine Learning,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the training bottleneck problem encountered by deep reinforcement learning (DRL) in complex tasks, especially the challenges in sparse - reward environments. Specifically: 1. **Training Bottleneck Problem**: When DRL agents are trained in complex tasks, they often get stuck in a bottleneck stage and cannot further improve their performance. This is manifested as the agent performing poorly in some critical states or being unable to achieve the final goal. 2. **Challenges in Sparse - Reward Environments**: In sparse - reward environments, it is difficult for agents to obtain sufficient feedback to guide their learning process, resulting in low training efficiency and poor performance. To solve these problems, the authors propose RICE (Reinforcement learning with Explanation), an innovative reinforcement learning refinement scheme. The main idea of RICE is to identify critical states through explanation methods and construct a new initial - state distribution, combining the default initial state with these critical states, thereby encouraging the agent to start exploration from a mixed initial state. ### Core Contributions of RICE 1. **Breaking Through the Training Bottleneck**: By combining explanation methods and a mixed initial - state distribution, RICE can effectively help the agent jump out of local optimal solutions and improve overall performance. 2. **Theoretical Guarantee**: The design of RICE ensures that it has a tighter sub - optimality bound, that is, it can theoretically better approximate the optimal policy. 3. **Improved Explanation Method**: The authors have improved the existing StateMask method, making it improve training efficiency while maintaining explanation accuracy. 4. **Extensive Experimental Verification**: The authors have evaluated the performance of RICE in multiple simulation games and real - world applications, and the results show that RICE is significantly superior to existing refinement methods. ### Formula Representation The formulas involved in the paper include, but are not limited to, the following: - **Value Function and Q - Function**: \[ V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t = 0}^{\infty}\gamma^{t}R(s_{t},a_{t})\mid s_{0}=s\right] \] \[ Q^{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{t = 0}^{\infty}\gamma^{t}R(s_{t},a_{t})\mid s_{0}=s,a_{0}=a\right] \] - **Advantage Function**: \[ A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s) \] - **State Occupancy Distribution**: \[ d_{\rho}^{\pi}(s)=(1-\gamma)\sum_{t = 0}^{\infty}\gamma^{t}\Pr_{\pi}(s_{t}=s\mid s_{0}\sim\rho) \] - **Sub - Optimality Bound**: \[ \text{SubOpt}:=V^{\pi^{*}}(\rho)-V^{\pi'}(\rho) \] Through these formulas, the authors show how RICE can effectively solve the bottleneck problem in DRL training both theoretically and practically.