Abstract:Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: <a class="link-external link-https" href="https://shape-of-motion.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address the problem of four-dimensional (4D) reconstruction from a single video, recovering the continuous geometric structure and three-dimensional (3D) motion of dynamic scenes from ordinary single-view videos. The research team proposes a novel method that can reconstruct general dynamic scenes from monocular videos, including explicit 3D motion that spans the entire video sequence, without relying on templates or only being effective in quasi-static scenes. This method overcomes the under-constraint issue of the problem through two key insights: 1. Utilizing the low-dimensional structure of 3D motion, the scene motion is represented as a linear combination of a compact set of SE(3) motion bases. The motion of each point can be decomposed into these bases, enabling soft grouping of the scene and identification of multiple rigid body motions. 2. Using a range of data-driven priors, including monocular depth maps and long-range two-dimensional (2D) trajectories, an effective method is designed to integrate these noisy supervisory signals and obtain a consistent representation of the dynamic scene. Experiments show that this method achieves state-of-the-art performance in long-range 3D/2D motion estimation and novel view synthesis of dynamic scenes. Specifically, the proposed dynamic scene representation enables real-time novel view synthesis and globally consistent 3D tracking at any point in time. Furthermore, an optimization framework is developed to adapt the representation to complex dynamic scenes in monocular videos by utilizing physical motion priors and data-driven priors. The key contributions of the paper include: (1) a novel dynamic scene representation that strikes a balance between real-time novel view synthesis and globally consistent 3D tracking; (2) a carefully designed framework that optimizes the representation on monocular videos by leveraging physical motion priors and data-driven priors. These contributions significantly surpass previous methods in terms of long-range 2D and 3D tracking accuracy, as well as novel view synthesis quality.

Shape of Motion: 4D Reconstruction from a Single Video

In-Hand 3D Object Reconstruction from a Monocular RGB Video

Towards robust 3d reconstruction of human motion from monocular video

DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Feature-Assisted Dense Spatio-Temporal Reconstruction From Binocular Sequences

3d Reconstruction Of Dynamic Scenes With Multiple Handheld Cameras

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis

Temporally Coherent General Dynamic Scene Reconstruction

Temporally Coherent 4D Reconstruction of Complex Dynamic Scenes

R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Video Motion Capture by Silhouette Analysis and Pose Optimization

Three-Dimensional Motion Estimation Via Matrix Completion

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

GFlow: Recovering 4D World from Monocular Video

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Real-time Indoor Scene Reconstruction with RGBD and Inertial Input.