Moving Object Segmentation: All You Need Is SAM (and Flow)

Junyu Xie,Charig Yang,Weidi Xie,Andrew Zisserman
2024-04-19
Abstract:The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is motion segmentation in videos, which involves detecting and segmenting moving objects in videos. This is a widely researched field with many well-designed and sometimes quite complex methods and training schemes, including self-supervised learning, learning from synthetic datasets, object-centric representations, modal representations, etc. The focus of this paper is to explore whether the Segment Anything Model (SAM) can contribute to this task. To achieve this goal, the authors propose two models that combine SAM with optical flow, aiming to leverage SAM's segmentation capabilities and optical flow's ability to detect and group moving objects. The first model, FlowI-SAM, directly uses optical flow as input instead of traditional RGB images. The second model, FlowP-SAM, uses RGB images as input and employs optical flow as segmentation cues. Both methods significantly outperform all previous methods in single-object and multi-object benchmarks. Additionally, these frame-level segmentations are extended to sequence-level segmentations to maintain object identity consistency. Similarly, this simple approach also outperforms previous methods on multiple video object segmentation benchmarks.