Learning Quadrotor Control From Visual Features Using Differentiable Simulation

Johannes Heeg,Yunlong Song,Davide Scaramuzza
2024-10-21
Abstract:The sample inefficiency of reinforcement learning (RL) remains a significant challenge in robotics. RL requires large-scale simulation and, still, can cause long training times, slowing down research and innovation. This issue is particularly pronounced in vision-based control tasks where reliable state estimates are not accessible. Differentiable simulation offers an alternative by enabling gradient back-propagation through the dynamics model, providing low-variance analytical policy gradients and, hence, higher sample efficiency. However, its usage for real-world robotic tasks has yet been limited. This work demonstrates the great potential of differentiable simulation for learning quadrotor control. We show that training in differentiable simulation significantly outperforms model-free RL in terms of both sample efficiency and training time, allowing a policy to learn to recover a quadrotor in seconds when providing vehicle state and in minutes when relying solely on visual features. The key to our success is two-fold. First, the use of a simple surrogate model for gradient computation greatly accelerates training without sacrificing control performance. Second, combining state representation learning with policy learning enhances convergence speed in tasks where only visual features are observable. These findings highlight the potential of differentiable simulation for real-world robotics and offer a compelling alternative to conventional RL approaches.
Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of low sample efficiency in reinforcement learning (RL) for robotic control tasks, especially in vision - based control tasks. Specifically, the paper explores the following issues: 1. **Sample efficiency problem**: - Reinforcement learning requires a large amount of simulation data and a long training time, which limits the speed of research and innovation. - In vision - based control tasks, the sample efficiency problem is particularly prominent due to the inability to obtain reliable state estimates. 2. **Limitations of traditional control methods**: - Traditional cascade controllers rely on accurate state estimates. When the state estimates are inaccurate or noisy, these methods become unreliable. - Learning - based methods (such as model - free reinforcement learning) perform well in some applications, but they face significant challenges in terms of sample complexity and training stability, especially in high - dimensional visual control tasks with observational data. 3. **Combining state representation learning and policy learning**: - In many real - world robotic scenarios, state information is not directly accessible, and neural networks must learn state representations and control policies simultaneously from feature observations, which increases sample complexity. ### The method proposed in the paper To solve the above problems, the author introduced differentiable simulation, an emerging learning method that allows back - propagating gradients through the dynamic model, thereby providing low - variance analytical policy gradients and improving sample efficiency. Specific contributions include: 1. **Using a simplified surrogate model to accelerate gradient calculation**: - Back - propagating through a simplified dynamic model can significantly accelerate the training speed without sacrificing performance. 2. **Combining state representation learning and policy learning**: - In vision - based tasks, pre - training neural network parameters to learn state representations can accelerate convergence and improve final performance. 3. **Verification in practical applications**: - Research shows that policies trained in different environments can be successfully deployed in the real world. For example, in the task of stabilizing a quadrotor drone after a manual throw, relying only on visual features without state estimation. ### Experimental results - **State - based control tasks**: The BPTT method using differentiable simulation is more than 8 times faster than PPO (Proximal Policy Optimization) and requires more than 80% less sample size. - **Vision - feature - based control tasks**: Even in more complex visual control tasks, the BPTT method significantly outperforms PPO, showing faster convergence speed and higher final performance. In conclusion, this paper demonstrates the great potential of differentiable simulation technology in quadrotor drone control tasks, especially in improving sample efficiency and shortening training time.