Abstract:Learning low-dimensional latent state space dynamics models has been a powerful paradigm for enabling vision-based planning and learning for control. We introduce a latent dynamics learning framework that is uniquely designed to induce proportional controlability in the latent space, thus enabling the use of much simpler controllers than prior work. We show that our learned dynamics model enables proportional control from pixels, dramatically simplifies and accelerates behavioural cloning of vision-based controllers, and provides interpretable goal discovery when applied to imitation learning of switching controllers from demonstration.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve some key challenges in vision - based control, especially how to directly achieve proportional control from pixel - level input, thereby simplifying and accelerating behavioral cloning and target recognition. Specifically, the author introduces a new framework - NewtonianVAE. By learning the dynamic model in the low - dimensional latent space, it becomes possible to directly apply a simple PID controller for control. #### Main problems include: 1. **The need for complex planning and reinforcement - learning strategies**: - In traditional vision - based control methods, complex planning or reinforcement - learning strategies are usually required to move to the target state. This is not only computationally costly but also difficult to achieve on high - dimensional visual data. - NewtonianVAE, through the structured latent dynamic model, enables the direct application of simple proportional control, thus avoiding the need for complex planning or reinforcement - learning. 2. **Challenges in imitation learning from high - dimensional visual data**: - High - dimensional visual data (such as images) makes imitation learning very difficult, especially for multi - target tasks or multi - stage tasks. - NewtonianVAE solves this problem by transforming imitation learning into a target inference problem in the latent space, enabling one - shot imitation learning from high - dimensional pixel observations. 3. **Interpretability and explainability**: - Existing variational auto - encoder (VAE) models are often difficult to interpret in the latent space, especially when applying proportional control. - NewtonianVAE improves the interpretability of the latent space by introducing physical constraints (such as Newton's second law), allowing for an intuitive understanding of the system's behavior. 4. **Application of path tracking and dynamic movement primitives (DMPs)**: - Dynamic movement primitives (DMPs) are powerful tools for trajectory tracking, but face challenges when applied to high - dimensional visual data. - NewtonianVAE enables trajectory tracking and path following directly from pixels by learning DMPs in the latent space, thus achieving efficient visual control. #### Formula Explanation: - **PID control formula**: \[ u_t = K_p (x_{\text{goal}, t} - x_t) + K_i \sum_{t'} (x_{\text{goal}, t'} - x_{t'}) + K_d \frac{x_t - x_{t - 1}}{\Delta t} \] where \(K_p\), \(K_i\) and \(K_d\) are gain terms, corresponding to proportional, integral and differential control respectively. - **Representation of Newton's second law in the latent space**: \[ \frac{d^2 x}{dt^2} = F/m \] In NewtonianVAE, the action \(u\) represents the force (acceleration) acting on the system, and the position \(x\) and velocity \(v\) should follow Newton's second law. Through these improvements, NewtonianVAE not only simplifies vision - based control tasks but also improves the interpretability and robustness of the system, and is suitable for a variety of complex control scenarios.

NewtonianVAE: Proportional Control and Goal Identification from Pixels via Physical Latent Spaces

Learn Proportional Derivative Controllable Latent Space from Pixels

ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters

Tracking control of latent dynamic systems with application to spacecraft attitude control

Learning deep dynamical models from image pixels

Feedback from Pixels: Output Regulation via Learning-Based Scene View Synthesis

PcLast: Discovering Plannable Continuous Latent States

Character Controllers Using Motion VAEs

LVD-NMPC: A Learning-based Vision Dynamics Approach to Nonlinear Model Predictive Control for Autonomous Vehicles

Learning low-dimensional dynamics from whole-brain data improves task capture

Learning Sequential Latent Variable Models from Multimodal Time Series Data

Interpretable Representation Learning from Videos using Nonlinear Priors

Learning Nonlinear Projections for Reduced-Order Modeling of Dynamical Systems using Constrained Autoencoders

Back to Newton's Laws: Learning Vision-based Agile Flight via Differentiable Physics

Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?

Towards Learning Controllable Representations of Physical Systems

Identifiable Representation and Model Learning for Latent Dynamic Systems

VMP: Versatile Motion Priors for Robustly Tracking Motion on Physical Characters

Visual Imitation Learning of Non-Prehensile Manipulation Tasks with Dynamics-Supervised Models

Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

Bayesian Optimization in Variational Latent Spaces with Dynamic Compression