Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Colton Stearns,Adam Harley,Mikaela Uy,Florian Dubost,Federico Tombari,Gordon Wetzstein,Leonidas Guibas
DOI: https://doi.org/10.1145/3680528.3687681
2024-09-11
Abstract:Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision. In this work, we are interested in extending the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose a method we call Dynamic Gaussian Marbles, which consist of three core modifications that target the difficulties of the monocular setting. First, we use isotropic Gaussian "marbles'', reducing the degrees of freedom of each Gaussian. Second, we employ a hierarchical divide and-conquer learning strategy to efficiently guide the optimization towards solutions with globally coherent motion. Finally, we add image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization, Dynamic Gaussian Marbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the Nvidia Dynamic Scenes dataset and the DyCheck iPhone dataset, and show that Gaussian Marbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians. Our project page can be found here <a class="link-external link-https" href="https://geometry.stanford.edu/projects/dynamic-gaussian-marbles.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve novel view synthesis of dynamic scenes in daily monocular videos. Specifically, the paper focuses on how to extract 3D geometric structures, motion, and radiance from dynamic - scene videos captured by a single camera and be able to render these scenes from new viewpoints. The key to this challenge lies in recovering 3D information from single - view videos, which is much more difficult than the task in multi - view settings because the latter can provide more constraints to assist the reconstruction process. The paper points out that although the existing 4D Gaussian methods perform well in multi - view videos, they encounter serious problems in monocular videos, mainly due to the under - constrained problem in the monocular setting. To overcome these problems, the authors propose a method named "Dynamic Gaussian Marbles", which adapts to the challenges of monocular videos by introducing three core improvements: 1. **Using isotropic Gaussian "marbles"**: Reduce the degrees of freedom of each Gaussian function, making the optimization process focus more on motion and appearance rather than local shape. 2. **Divide - and - conquer learning strategy**: Adopt a hierarchical learning method to gradually guide the optimization process in order to achieve globally consistent motion. 3. **Adding image - level and geometry - level priors**: Introduce tracking losses and other prior knowledge during the optimization process to improve the robustness and accuracy of the model. Through these improvements, Dynamic Gaussian Marbles can achieve high - quality novel view synthesis in monocular videos while maintaining efficient rendering, good tracking performance, and editability.