Abstract:The ``AI Olympics with RealAIGym'' competition challenges participants to stabilize chaotic underactuated dynamical systems with advanced control algorithms. In this paper, we present a novel solution submitted to IROS'24 competition, which builds upon Soft Actor-Critic (SAC), a popular model-free entropy-regularized Reinforcement Learning (RL) algorithm. We add a `context' vector to the state, which encodes the immediate history via a Convolutional Neural Network (CNN) to counteract the unmodeled effects on the real system. Our method achieves high performance scores and competitive robustness scores on both tracks of the competition: Pendubot and Acrobot.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use advanced control algorithms to stabilize chaotic under - actuated dynamic systems in the IROS’24 competition "AI Olympics with RealAIGym". Specifically, the competition requires contestants to design a method that enables the robot to swing from a hanging position and stabilize to an upright state, especially to maintain the stability and robustness of the system in the face of random disturbances.
### Problem Background
1. **Competition Objectives**:
- The competition aims to evaluate the motion intelligence of robots through standardized benchmark tasks, especially for under - actuated double - inverted pendulum systems (such as Pendubot and Acrobot).
- Performance evaluation includes the performance score in a single simulation run and the robustness score in multiple simulation runs. The latter takes into account the effects of physical parameter changes, noise, and other perturbations.
2. **Challenges**:
- Under - actuated systems and their chaotic nature make control very difficult.
- Random disturbances (such as strong thrusts) occur randomly during the swinging and stabilizing processes, increasing the control difficulty.
- Sim - to - Real gap: Strategies trained in the simulation environment may perform poorly on real systems.
### Solutions
To address these challenges, the authors propose an improved method based on the Soft Actor - Critic (SAC) algorithm, called Velocity - History - Based Soft Actor - Critic. The main innovations include:
1. **History Encoding**:
- A "context" vector is introduced into the state representation. This vector encodes past velocity measurements through a convolutional neural network (CNN) to capture the historical information of the system.
- This method helps the model better understand and adapt to dynamic changes in non - Markovian or partially observable systems.
2. **Reward Design**:
- A dense reward function is designed to provide more abundant feedback signals, thereby accelerating the learning speed and improving the performance of the final strategy.
- The specific reward function \( R_2(s, a) \) contains two main terms: the squared angular distance term and the regularization term \( E(s, a) \), which is used to penalize large angular velocity and torque changes.
3. **System Identification**:
- The Sim - to - Real gap is narrowed by optimizing physical parameters to ensure that the behavior of the simulation environment is more consistent with that of the real system.
- The differential evolution algorithm is used to minimize the difference between the simulated trajectory and the real trajectory.
4. **Multi - environment Training**:
- The robustness of the model is improved by training in multiple different perturbation environments.
- Certain types of perturbations (such as torque perturbations and action noise) are excluded to avoid over - complicating the training process.
### Results
The experimental results show that the proposed method achieves significant performance improvement and robustness enhancement on the Pendubot system. Compared with benchmark controllers (such as iLQR, TVLQR, etc.), it performs well in terms of swing time and energy consumption. In addition, this method outperforms existing methods in multiple evaluation metrics, especially reaching a robustness score of 0.905.
In conclusion, this paper successfully solves the problem of stable control of under - actuated dynamic systems by introducing historical information encoding and optimizing reward design, and achieves excellent results in the competition.