Abstract: We focus on the task of future frame prediction in video governed by underlying physical dynamics. We work with models which are object-centric, i.e., explicitly work with object representations, and propagate a loss in the latent space. Specifically, our research builds on recent work by Kipf et al. \cite{kipf&al20}, which predicts the next state via contrastive learning of object interactions in a latent space using a Graph Neural Network. We argue that injecting explicit inductive bias in the model, in form of general physical laws, can help not only make the model more interpretable, but also improve the overall prediction of model. As a natural by-product, our model can learn feature maps which closely resemble actual object positions in the image, without having any explicit supervision about the object positions at the training time. In comparison with earlier works \cite{jaques&al20}, which assume a complete knowledge of the dynamics governing the motion in the form of a physics engine, we rely only on the knowledge of general physical laws, such as, world consists of objects, which have position and velocity. We propose an additional decoder based loss in the pixel space, imposed in a curriculum manner, to further refine the latent space predictions. Experiments in multiple different settings demonstrate that while Kipf et al. model is effective at capturing object interactions, our model can be significantly more effective at localising objects, resulting in improved performance in 3 out of 4 domains that we experiment with. Additionally, our model can learn highly intrepretable feature maps, resembling actual object positions.

DDLP: Unsupervised Object-Centric Video Prediction with Deep Dynamic Latent Particles

Unsupervised Image Representation Learning with Deep Latent Particles

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Object-centric Video Prediction without Annotation

Learning Physical Dynamics for Object-centric Visual Prediction

DDP: Diffusion Model for Dense Visual Prediction

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Video Probabilistic Diffusion Models in Projected Latent Space

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Unsupervised Detection and Tracking of Arbitrary Objects with Dependent Dirichlet Process Mixtures

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Video Interpolation and Prediction with Unsupervised Landmarks

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

Disentangling Propagation and Generation for Video Prediction

Towards an Interpretable Latent Space in Structured Models for Video Prediction

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models