The Power of Next-Frame Prediction for Learning Physical Laws

Thomas Winterbottom,G. Thomas Hudson,Daniel Kluvanec,Dean Slack,Jamie Sterling,Junjie Shentu,Chenghao Xiao,Zheming Zhou,Noura Al Moubayed
2024-05-22
Abstract:Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper discusses the possibility of using next-frame prediction as a fundamental learning strategy to understand the dynamics and physical laws of video data. Inspired by the success of word token prediction in language modeling, the researchers quantified specific visual understanding induced by next-frame prediction by creating six diagnostic simulation video datasets based on fundamental physical laws. They found that models trained solely on next-frame prediction were able to predict physical constants such as gravity without the need to directly learn these constants through regression tasks. This suggests that self-supervised pre-training can potentially implicitly learn these laws and demonstrate great potential in understanding "laws" in the visual domain without explicit labels. Furthermore, the paper compares two model architectures, namely the fully convolutional 2D CNN and the Patch Transformer, and demonstrates significant performance improvement of pre-training in the simulated understanding task.