Shape of Motion: 4D Reconstruction from a Single Video

Qianqian Wang,Vickie Ye,Hang Gao,Jake Austin,Zhengqi Li,Angjoo Kanazawa
2024-07-19
Abstract:Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: <a class="link-external link-https" href="https://shape-of-motion.github.io/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of four-dimensional (4D) reconstruction from a single video, recovering the continuous geometric structure and three-dimensional (3D) motion of dynamic scenes from ordinary single-view videos. The research team proposes a novel method that can reconstruct general dynamic scenes from monocular videos, including explicit 3D motion that spans the entire video sequence, without relying on templates or only being effective in quasi-static scenes. This method overcomes the under-constraint issue of the problem through two key insights: 1. Utilizing the low-dimensional structure of 3D motion, the scene motion is represented as a linear combination of a compact set of SE(3) motion bases. The motion of each point can be decomposed into these bases, enabling soft grouping of the scene and identification of multiple rigid body motions. 2. Using a range of data-driven priors, including monocular depth maps and long-range two-dimensional (2D) trajectories, an effective method is designed to integrate these noisy supervisory signals and obtain a consistent representation of the dynamic scene. Experiments show that this method achieves state-of-the-art performance in long-range 3D/2D motion estimation and novel view synthesis of dynamic scenes. Specifically, the proposed dynamic scene representation enables real-time novel view synthesis and globally consistent 3D tracking at any point in time. Furthermore, an optimization framework is developed to adapt the representation to complex dynamic scenes in monocular videos by utilizing physical motion priors and data-driven priors. The key contributions of the paper include: (1) a novel dynamic scene representation that strikes a balance between real-time novel view synthesis and globally consistent 3D tracking; (2) a carefully designed framework that optimizes the representation on monocular videos by leveraging physical motion priors and data-driven priors. These contributions significantly surpass previous methods in terms of long-range 2D and 3D tracking accuracy, as well as novel view synthesis quality.