Interpretable Representation Learning from Videos using Nonlinear Priors

Marian Longa,João F. Henriques
2024-10-24
Abstract:Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to learn interpretable representations from videos, so as to make the machine's decisions more understandable to humans and improve its generalization ability outside the training distribution. Specifically, the author proposes a deep - learning framework. By specifying non - linear priors (such as Newtonian physics), the model is made to learn interpretable latent variables and use these variables to generate videos of unseen hypothetical scenarios. ### Problem Background Traditional research work in the fields of causal reasoning and intuitive physics mainly focuses on identifying causal relationships between variables provided by domain experts or designing systems to imitate the human ability to understand physical phenomena using "common sense". However, these methods often lack the ability to perform counterfactual reasoning on unobserved scenarios. This paper aims to combine the advantages of both by introducing physical - mechanism priors, enabling the model to learn physical variables and generate counterfactual videos. ### Core Problems of the Paper 1. **How to learn interpretable representations from videos**: In order to make the machine's decisions more transparent and interpretable, a method needs to be developed to enable the model to learn latent variables with clear physical meanings. 2. **How to generate videos of hypothetical scenarios**: After learning the physical variables, the model should be able to generate videos of previously unobserved hypothetical scenarios by intervening in these variables, such as changing the gravitational acceleration, increasing air resistance, etc. 3. **How to handle non - linear priors**: Traditional variational auto - encoders (VAE) usually use simple isotropic Gaussian priors, while the method proposed in this paper extends this prior to enable it to describe complex non - linear processes (such as Newtonian physics). ### Solutions To solve the above problems, the author proposes the following innovations: 1. **A new interpretable representation - learning framework based on VAE**: This framework uses the non - linear additive noise model (ANM) as a prior, allowing the model to learn latent variables in time - series data. 2. **A new method for approximating prior density**: By locally linearizing the prior, it is automatically decomposed into a non - isotropic Gaussian mixture model (GMM), thereby achieving the modeling of complex non - linear processes. 3. **A numerically stable and highly parallel KL - divergence estimate**: A numerically stable and highly parallel KL - divergence estimate formula is derived for optimizing the objective function of VAE. 4. **Experimental verification**: Through experiments on four real - world physical videos, it is shown that this method can learn the correct latent variables and generate realistic counterfactual videos. ### Conclusions By introducing physical - mechanism priors, the author has successfully developed a method that can learn physical variables from videos and generate videos of hypothetical scenarios. This method not only improves the interpretability of the model but also can generate realistic counterfactual videos outside the training distribution. Future work will further expand the application range of this method and relax the assumption that the prior must be fully specified. ### Formula Summary - **Formula for the change of object position over time**: \[ y(t)=A\cos(\omega t)+n \] where \(A\) is the amplitude, \(\omega\) is the angular frequency, and \(n\) is Gaussian noise. - **Additive Noise Model (ANM)**: \[ y = f(x)+n \] where \(f(x)\) is a known physical mechanism and \(n\) is Gaussian noise. - **Mean and covariance matrix of Gaussian Mixture Model (GMM)**: \[ \begin{pmatrix} t \\ y \end{pmatrix}\sim\mathcal{N}\left(\begin{pmatrix} t_0 \\ f(t_0)+n_0 \end{pmatrix},\begin{bmatrix} \sigma_t^2 & \frac{\partial f}{\partial t}(t_0)\sigma_t^2 \\ \frac{\partial f}{\partial t}(t_0)\sigma_t^2 & \left(\frac{\partial f}{\partial t}(t_0)\right)^2\sigma \end{bmatrix}\right)