Abstract:Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to learn interpretable representations from videos, so as to make the machine's decisions more understandable to humans and improve its generalization ability outside the training distribution. Specifically, the author proposes a deep - learning framework. By specifying non - linear priors (such as Newtonian physics), the model is made to learn interpretable latent variables and use these variables to generate videos of unseen hypothetical scenarios. ### Problem Background Traditional research work in the fields of causal reasoning and intuitive physics mainly focuses on identifying causal relationships between variables provided by domain experts or designing systems to imitate the human ability to understand physical phenomena using "common sense". However, these methods often lack the ability to perform counterfactual reasoning on unobserved scenarios. This paper aims to combine the advantages of both by introducing physical - mechanism priors, enabling the model to learn physical variables and generate counterfactual videos. ### Core Problems of the Paper 1. **How to learn interpretable representations from videos**: In order to make the machine's decisions more transparent and interpretable, a method needs to be developed to enable the model to learn latent variables with clear physical meanings. 2. **How to generate videos of hypothetical scenarios**: After learning the physical variables, the model should be able to generate videos of previously unobserved hypothetical scenarios by intervening in these variables, such as changing the gravitational acceleration, increasing air resistance, etc. 3. **How to handle non - linear priors**: Traditional variational auto - encoders (VAE) usually use simple isotropic Gaussian priors, while the method proposed in this paper extends this prior to enable it to describe complex non - linear processes (such as Newtonian physics). ### Solutions To solve the above problems, the author proposes the following innovations: 1. **A new interpretable representation - learning framework based on VAE**: This framework uses the non - linear additive noise model (ANM) as a prior, allowing the model to learn latent variables in time - series data. 2. **A new method for approximating prior density**: By locally linearizing the prior, it is automatically decomposed into a non - isotropic Gaussian mixture model (GMM), thereby achieving the modeling of complex non - linear processes. 3. **A numerically stable and highly parallel KL - divergence estimate**: A numerically stable and highly parallel KL - divergence estimate formula is derived for optimizing the objective function of VAE. 4. **Experimental verification**: Through experiments on four real - world physical videos, it is shown that this method can learn the correct latent variables and generate realistic counterfactual videos. ### Conclusions By introducing physical - mechanism priors, the author has successfully developed a method that can learn physical variables from videos and generate videos of hypothetical scenarios. This method not only improves the interpretability of the model but also can generate realistic counterfactual videos outside the training distribution. Future work will further expand the application range of this method and relax the assumption that the prior must be fully specified. ### Formula Summary - **Formula for the change of object position over time**: \[ y(t)=A\cos(\omega t)+n \] where \(A\) is the amplitude, \(\omega\) is the angular frequency, and \(n\) is Gaussian noise. - **Additive Noise Model (ANM)**: \[ y = f(x)+n \] where \(f(x)\) is a known physical mechanism and \(n\) is Gaussian noise. - **Mean and covariance matrix of Gaussian Mixture Model (GMM)**: \[ \begin{pmatrix} t \\ y \end{pmatrix}\sim\mathcal{N}\left(\begin{pmatrix} t_0 \\ f(t_0)+n_0 \end{pmatrix},\begin{bmatrix} \sigma_t^2 & \frac{\partial f}{\partial t}(t_0)\sigma_t^2 \\ \frac{\partial f}{\partial t}(t_0)\sigma_t^2 & \left(\frac{\partial f}{\partial t}(t_0)\right)^2\sigma \end{bmatrix}\right)

Interpretable Representation Learning from Videos using Nonlinear Priors

Towards an Interpretable Latent Space in Structured Models for Video Prediction

Video-Language Models as Flexible Social and Physical Reasoners

Neural Implicit Representations for Physical Parameter Inference from a Single Video

Learning to Represent Mechanics via Long-term Extrapolation and Interpolation

Learning functional priors and posteriors from data and physics

Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video

From latent dynamics to meaningful representations

Physics-enhanced Gaussian Process Variational Autoencoder

Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

Towards Principled Representation Learning from Videos for Reinforcement Learning

Learning Interpretable Dynamics from Images of a Freely Rotating 3D Rigid Body

Physics-Informed Priors with Application to Boundary Layer Velocity

Variational Encoder-Decoders for Learning Latent Representations of Physical Systems

Unsupervised Image Representation Learning with Deep Latent Particles

Learning intermediate-level representations of form and motion from natural movies

Resolution-independent generative models based on operator learning for physics-constrained Bayesian inverse problems

Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Learning Kinematic Formulas from Multiple View Videos

Extracting Interpretable Physical Parameters from Spatiotemporal Systems using Unsupervised Learning