AI Olympics challenge with Evolutionary Soft Actor Critic

Marco Calì,Alberto Sinigaglia,Niccolò Turcato,Ruggero Carli,Gian Antonio Susto
2024-10-28
Abstract:In the following report, we describe the solution we propose for the AI Olympics competition held at IROS 2024. Our solution is based on a Model-free Deep Reinforcement Learning approach combined with an evolutionary strategy. We will briefly describe the algorithms that have been used and then provide details of the approach
Robotics,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design an effective controller for an under - actuated double - pendulum system (including acrobot and pendubot settings) in the AI Olympics competition to achieve the swing - up and stable control of the pendulum. Specifically, the goals of the competition are: 1. **Simulation phase**: - Design a controller that can make the robot swing up and stabilize at the vertical position. - The controller needs to complete the task within 10 seconds, and the system is simulated at a frequency of 500Hz. - Evaluate the robustness of the controller. 2. **Actual hardware phase**: - Test the performance of the controller on the physical system. - Solve the differences between the simulation environment and the real environment, such as the influence of different factors such as mass, length, and friction effects. To achieve these goals, the author proposes a method based on model - free deep reinforcement learning combined with evolutionary strategy. The specific steps are as follows: - **Initial training**: Use the Soft Actor - Critic (SAC) algorithm to train the agent so that it can perform the main tasks (i.e., swing - up and stabilization). The SAC algorithm promotes exploration and improves the robustness of the policy by introducing an entropy term. - **Fine - tuning**: Further optimize the agent through an evolutionary algorithm (such as Separable Natural Evolution Strategy, SNES) to better adapt to the actual scoring function of the competition. - **Reward function design**: Since the reward function of the competition is complex and difficult to optimize directly, the author designs a surrogate reward function to facilitate the optimization in the training process. Through this method, the author hopes to achieve excellent results in the simulation environment and make the controller also show good performance and robustness on the actual hardware. ### Key formulas 1. **Optimization objective of SAC**: \[ J(\pi)=\mathbb{E}_{s_t, a_t\sim\pi}\left[\sum_{t}\gamma^t\left(r(s_t, a_t)+\alpha H(\pi(\cdot|s_t))\right)\right] \] where \(H\) represents the entropy of the policy, and \(\alpha\) is a temperature parameter that controls the importance of the entropy term. 2. **Surrogate reward function**: \[ R(s, a)=\begin{cases} V+\alpha[1 + \cos(\theta_2)]^2-\beta T&\text{if }y > y_{th}\\ -\rho_1a^2-\phi_1\Delta a+V-\rho_2a^2-\phi_2\Delta a-\eta\|\dot{s}\|^2&\text{otherwise} \end{cases} \] where: - \(V\) is the potential energy of the system, - \(T\) is the kinetic energy of the system, - \(a\) is the normalized action, - \(\Delta a\) is the difference between the current action and the previous action, - \(\|\dot{s}\|^2 = \dot{\theta}_1^2+\dot{\theta}_2^2\) is the squared norm of the angular velocity of the robot. Through the above methods, the author aims to find a solution that can be trained efficiently and can also cope with practical challenges.