NewtonianVAE: Proportional Control and Goal Identification from Pixels via Physical Latent Spaces

Miguel Jaques,Michael Burke,Timothy Hospedales
DOI: https://doi.org/10.48550/arXiv.2006.01959
2021-04-27
Abstract:Learning low-dimensional latent state space dynamics models has been a powerful paradigm for enabling vision-based planning and learning for control. We introduce a latent dynamics learning framework that is uniquely designed to induce proportional controlability in the latent space, thus enabling the use of much simpler controllers than prior work. We show that our learned dynamics model enables proportional control from pixels, dramatically simplifies and accelerates behavioural cloning of vision-based controllers, and provides interpretable goal discovery when applied to imitation learning of switching controllers from demonstration.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve some key challenges in vision - based control, especially how to directly achieve proportional control from pixel - level input, thereby simplifying and accelerating behavioral cloning and target recognition. Specifically, the author introduces a new framework - NewtonianVAE. By learning the dynamic model in the low - dimensional latent space, it becomes possible to directly apply a simple PID controller for control. #### Main problems include: 1. **The need for complex planning and reinforcement - learning strategies**: - In traditional vision - based control methods, complex planning or reinforcement - learning strategies are usually required to move to the target state. This is not only computationally costly but also difficult to achieve on high - dimensional visual data. - NewtonianVAE, through the structured latent dynamic model, enables the direct application of simple proportional control, thus avoiding the need for complex planning or reinforcement - learning. 2. **Challenges in imitation learning from high - dimensional visual data**: - High - dimensional visual data (such as images) makes imitation learning very difficult, especially for multi - target tasks or multi - stage tasks. - NewtonianVAE solves this problem by transforming imitation learning into a target inference problem in the latent space, enabling one - shot imitation learning from high - dimensional pixel observations. 3. **Interpretability and explainability**: - Existing variational auto - encoder (VAE) models are often difficult to interpret in the latent space, especially when applying proportional control. - NewtonianVAE improves the interpretability of the latent space by introducing physical constraints (such as Newton's second law), allowing for an intuitive understanding of the system's behavior. 4. **Application of path tracking and dynamic movement primitives (DMPs)**: - Dynamic movement primitives (DMPs) are powerful tools for trajectory tracking, but face challenges when applied to high - dimensional visual data. - NewtonianVAE enables trajectory tracking and path following directly from pixels by learning DMPs in the latent space, thus achieving efficient visual control. #### Formula Explanation: - **PID control formula**: \[ u_t = K_p (x_{\text{goal}, t} - x_t) + K_i \sum_{t'} (x_{\text{goal}, t'} - x_{t'}) + K_d \frac{x_t - x_{t - 1}}{\Delta t} \] where \(K_p\), \(K_i\) and \(K_d\) are gain terms, corresponding to proportional, integral and differential control respectively. - **Representation of Newton's second law in the latent space**: \[ \frac{d^2 x}{dt^2} = F/m \] In NewtonianVAE, the action \(u\) represents the force (acceleration) acting on the system, and the position \(x\) and velocity \(v\) should follow Newton's second law. Through these improvements, NewtonianVAE not only simplifies vision - based control tasks but also improves the interpretability and robustness of the system, and is suitable for a variety of complex control scenarios.