Lily Goli,Sara Sabour,Mark Matthews,Marcus Brubaker,Dmitry Lagun,Alec Jacobson,David J. Fleet,Saurabh Saxena,Andrea Tagliasacchi
Abstract:There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in monocular videos, how to improve the performance of the Structure from Motion (SfM) method through robust motion segmentation. Specifically, SfM methods usually assume that the scene is static, and the existence of dynamic objects violates this assumption, making it difficult for SfM methods to handle videos containing dynamic content. Therefore, this paper proposes a new method, RoMo, which aims to improve the accuracy of SfM in handling dynamic scenes by distinguishing between static and dynamic parts in the video.
### Problem Background
1. **Limitations of SfM methods**: Existing SfM methods (such as COLMAP) perform poorly when processing videos containing dynamic objects because these methods rely on the assumption of a static scene. The presence of dynamic objects can interfere with camera pose estimation, resulting in reconstruction failure or inaccuracy.
2. **Challenges in motion segmentation**: The goal of the motion segmentation task is to separate moving objects from videos, which is crucial for many downstream tasks (such as augmented reality, autonomous driving, action recognition, and 4D scene reconstruction). However, compared to image and video segmentation, there is relatively little research on motion segmentation, and existing methods have deficiencies:
- **Supervised methods**: They rely on synthetic data for training and lack the support of real - world labeled data.
- **Unsupervised methods**: Although they do not require labeled data, their performance is not as good as that of supervised methods, and they fail to fully utilize 3D geometric constraints.
### Proposed Solution
To solve the above problems, this paper proposes a simple and effective iterative method, RoMo, which combines optical flow and epipolar geometry and uses a pre - trained video segmentation model to generate high - quality motion segmentation masks. The specific steps are as follows:
1. **Weakly - supervised epipolar geometry**: By calculating the optical flow field between adjacent frames and using epipolar geometry to identify pixels consistent with camera motion, the static and dynamic regions are initially distinguished.
2. **Feature - based classifier**: Use a pre - trained video segmentation model to extract features, and combine the sparse labels generated in the previous step to train a lightweight classifier to generate higher - quality segmentation masks.
3. **Iterative optimization**: Through multiple iterations, gradually improve the quality of camera pose estimation and motion segmentation masks.
4. **Final refinement**: Use the SAMv2 model to further refine the segmentation masks and improve the resolution.
### Experimental Results
Experiments show that RoMo significantly outperforms existing unsupervised methods in the motion segmentation task on multiple benchmark datasets (such as DAVIS2016, SegTrackV2, and FBMS59), and also performs well in the SfM task of handling dynamic scenes. In particular, on the MPI Sintel dataset, RoMo significantly improves the accuracy of camera trajectory estimation.
### Summary
The RoMo method proposed in this paper effectively solves the performance bottleneck problem of the SfM method in dynamic scenes by combining optical flow, epipolar geometry, and a pre - trained segmentation model, providing new ideas and tools for processing complex dynamic videos.