Abstract:Reinforcement learning (RL) is promising for complicated stochastic nonlinear control problems. Without using a mathematical model, an optimal controller can be learned from data evaluated by certain performance criteria through trial-and-error. However, the data-based learning approach is notorious for not guaranteeing stability, which is the most fundamental property for any control system. In this paper, the classic Lyapunov's method is explored to analyze the uniformly ultimate boundedness stability (UUB) solely based on data without using a mathematical model. It is further shown how RL with UUB guarantee can be applied to control dynamic systems with safety constraints. Based on the theoretical results, both off-policy and on-policy learning algorithms are proposed respectively. As a result, optimal controllers can be learned to guarantee UUB of the closed-loop system both at convergence and during learning. The proposed algorithms are evaluated on a series of robotic continuous control tasks with safety constraints. In comparison with the existing RL algorithms, the proposed method can achieve superior performance in terms of maintaining safety. As a qualitative evaluation of stability, our method shows impressive resilience even in the presence of external disturbances.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to ensure the stability of the system in reinforcement learning control, especially for dynamic systems with safety constraints. Specifically, the goal of the paper is to develop a data - based method that can learn controllers through reinforcement learning (Reinforcement Learning, RL) while ensuring the uniformly ultimate boundedness (Uniformly Ultimate Boundedness, UUB) of the system. This solves the difficult problem that traditional RL methods are difficult to ensure stability without using a mathematical model, especially in control tasks that need to meet safety constraints.
### Main contributions of the paper
1. **Proposed a new data - based UUB theorem**: This theorem does not depend on the mathematical model of the system but analyzes the UUB stability of the system through sample data. This enables the stability of the control system to be ensured even when the system model is unknown.
2. **Extended the definition of UUB**: The paper extended the classical UUB definition to include cases with safety constraints, making it applicable to a wider range of control tasks.
3. **Designed practical algorithms**: Based on the theoretical results, the paper proposed two algorithms - the policy - based algorithm (Lyapunov - based Constrained Policy Optimization, LCPO) and the actor - critic - based algorithm (Lyapunov - based Soft Actor - Critic, LSAC), which can ensure the UUB stability of the system during the learning process.
4. **Experimental verification**: The paper verified the effectiveness of the proposed algorithms in a series of high - dimensional continuous control tasks, including the motion control of legged robots, robotic arms, and quadrotor drones. The experimental results show that the proposed algorithms are not only superior in performance to existing safe RL algorithms but also show stronger robustness in the face of perturbations.
### Specific methods for solving problems
- **Construction of Lyapunov function**: The paper uses the Lyapunov function to prove the stability of the system. To make the Lyapunov function applicable to the reinforcement learning framework, the Lyapunov critic function (Lyapunov Critic Function, Lc) was introduced, and Lc was updated by minimizing the objective function.
- **Verification of UUB conditions**: The UUB condition (3) was verified through data sampling and the Monte Carlo method to ensure that the system always satisfies UUB stability during the learning process.
- **Algorithm design**: Two algorithms, LCPO and LSAC, were designed, which are respectively applicable to the policy optimization and actor - critic frameworks. These algorithms ensure that the UUB conditions are met during the learning process by adjusting the controller parameters, thereby ensuring the stability and safety of the system.
### Conclusion
By proposing new theoretical results and practical algorithms, the paper successfully solved the difficult problem of ensuring system stability in reinforcement learning control, especially in dynamic systems with safety constraints. This provides important theoretical and technical support for applying reinforcement learning to actual engineering control tasks.