Pavel Osinenko,Grigory Yaremenko,Georgiy Malaniya,Anton Bolychev,Alexander Gepperth
Abstract:Reinforcement learning is commonly concerned with problems of maximizing accumulated rewards in Markov decision processes. Oftentimes, a certain goal state or a subset of the state space attain maximal reward. In such a case, the environment may be considered solved when the goal is reached. Whereas numerous techniques, learning or non-learning based, exist for solving environments, doing so optimally is the biggest challenge. Say, one may choose a reward rate which penalizes the action effort. Reinforcement learning is currently among the most actively developed frameworks for solving environments optimally by virtue of maximizing accumulated reward, in other words, returns. Yet, tuning agents is a notoriously hard task as reported in a series of works. Our aim here is to help the agent learn a near-optimal policy efficiently while ensuring a goal reaching property of some basis policy that merely solves the environment. We suggest an algorithm, which is fairly flexible, and can be used to augment practically any agent as long as it comprises of a critic. A formal proof of a goal reaching property is provided. Comparative experiments on several problems under popular baseline agents provided an empirical evidence that the learning can indeed be boosted while ensuring goal reaching property.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to design an agent in reinforcement learning that can be guaranteed to reach the target state while improving learning efficiency. Specifically, the paper focuses on how to ensure that the agent can effectively learn to approach the optimal policy during the process of maximizing the cumulative reward (i.e., return), and ensure that a base policy can solve the environment (i.e., reach the target state). The paper proposes an algorithm that can enhance any agent containing a critic component, ensuring that it has the target - reaching property, thereby not only improving efficiency but also maintaining the ability to reach the target during the learning process.
### Main Contributions
1. **Proposed Method**: The paper proposes a reinforcement learning method that enhances the learning process while achieving the target - reaching property. When the agent reaches a base policy with the target - reaching property, this method can maintain this property and further improve the learning effect.
2. **Sample Efficiency**: By avoiding trials that cannot meet the probability target - reaching conditions, the sample efficiency is improved, thus avoiding meaningless attempts.
3. **Theoretical Analysis**: Provides strict mathematical analysis to prove the maintenance of the target - reaching property.
4. **Experimental Verification**: Experiments were carried out in six environments, including the inverted pendulum, double - tank system, non - holonomic three - wheeled robot, omnidirectional robot (omnibot), and lunar lander. Comparisons were made with multiple baseline agents (such as DDPG, SAC, TD3, PPO, REINFORCE, and VPG). The results show that the proposed agent significantly outperforms the baseline agents in learning dynamics, and the final performance is also equivalent or better.
### Objectives
- **Ensure Target Reach**: Even during the learning process, the agent can be guaranteed to reach the target state.
- **Improve Learning Efficiency**: By optimizing critic updates, reduce invalid attempts and increase the learning speed.
### Method Overview
- **Base Policy**: First, find a base policy \(\pi_0\) with the target - reaching property.
- **Critic Update**: In each iteration, try to update the critic so that it satisfies certain conditions, such as \(\hat{V}_w(s_{t + 1})-\hat{V}_{w^\dagger}(s_t^\dagger)>0\).
- **Policy Update**: If the critic update is successful, then update the policy; otherwise, use the base policy \(\pi_0\) to generate actions.
- **Constraint Conditions**: Ensure that the output value of the critic is within a certain range, for example, \(-\hat{\kappa}_{\text{up}}(\|s\|)\leq\hat{V}_w(s)\leq-\hat{\kappa}_{\text{low}}(\|s\|)\), where \(\hat{\kappa}_{\text{up}}\) and \(\hat{\kappa}_{\text{low}}\) are two monotonically increasing functions that tend to infinity.
### Experimental Results
- **Learning Curve**: In multiple environments, the proposed agent significantly outperforms the baseline agents in learning dynamics, and the final performance is also equivalent or better.
- **Convergence Time**: For example, in the inverted pendulum environment, the proposed agent reaches approximately optimal performance in far fewer than 20,000 time steps, while the SAC algorithm requires approximately 80,000 time steps.
### Limitations
- **Dependence on Base Policy**: The algorithm depends on a base policy \(\pi_0\) with the target - reaching property. Although it is usually not difficult to find such a policy, it is still a limitation.
- **Combined Policy**: Even if two policies with the target - reaching property are combined together, they do not necessarily have the target - reaching property. Therefore, the algorithm needs to carefully integrate \(\pi_0\).
### Conclusion
The method proposed in the paper significantly improves the learning efficiency of reinforcement learning agents while ensuring the target - reaching property. The experimental results show that this method performs well in multiple environments and has high practical value.