MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei,Yijia Weng,Adam Harley,Leonidas Guibas,Kostas Daniilidis
2024-11-30
Abstract:We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models and lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions/deformations. The scene geometry and appearance are then disentangled from the deformation field and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera focal length and poses can be solved using bundle adjustment without the need of any other pose estimation tools. Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks and its effectiveness on real videos.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenging problem of reconstructing and synthesizing dynamic scenes from monocular videos. Specifically, the paper proposes a new 4D reconstruction system named **4D Motion Scaffolds (MoSca)**, which is able to reconstruct and render dynamic scenes from monocular videos with unknown camera parameters. #### Main problems and challenges 1. **Limitations of data format**: - Monocular videos usually lack multi - view stereo cues, which makes robust 4D scene reconstruction from such input very difficult. 2. **Ill - posed nature of the inverse problem**: - Since the information provided by monocular videos is limited, the reconstruction task is essentially ill - posed, that is, there are multiple possible solutions. Therefore, prior knowledge needs to be used to constrain the solution space. 3. **Modeling of complex dynamic scenes**: - The movement and deformation of objects in dynamic scenes are usually complex, including problems such as occlusion and non - rigid deformation, which pose higher requirements for reconstruction and rendering. #### Overview of solutions To address these challenges, the paper proposes the following key techniques and methods: 1. **Utilizing pre - trained 2D visual foundation models**: - With the help of large - scale pre - trained 2D visual foundation models (such as depth estimation, 2D pixel trajectory tracking, etc.), preliminary cues of geometry and correspondence are provided. 2. **Introducing Motion Scaffold (MoSca) representation**: - A compact and smooth deformation representation - MoSca is designed. It encodes low - rank, smooth motion through sparse graph nodes and can be optimized by physically - inspired regularization. 3. **Globally fused Gaussian distribution**: - Use globally fused Gaussian distribution (Gaussian Splatting) to fuse the observations of all time steps together to generate a complete dynamic scene reconstruction. 4. **Camera pose estimation and optimization**: - Estimate the camera focal length and pose through bundle adjustment and photometric optimization without other pose estimation tools. #### Summary The main contribution of the paper is to propose a fully - automated 4D reconstruction system MoSca, which can handle monocular videos with free poses in the real world and solve multiple key problems in dynamic scene reconstruction. By combining powerful 2D visual foundation models and structured deformation representations, MoSca shows state - of - the - art performance in dynamic scene rendering benchmarks. If you have more specific questions or need further information, please feel free to let me know!