Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Feiyu Pan,Hao Fang,Runmin Cong,Wei Zhang,Xiankai Lu
2024-08-24
Abstract:Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the challenges in the Video Object Segmentation (VOS) task, especially for object segmentation in complex scenes. Specifically, the paper focuses on how to accurately segment specific object instances throughout the video sequence given the object mask in the first frame. This task is of great significance in multiple application areas such as robotics, video editing, and data annotation cost reduction. ### Main Contributions: 1. **Evaluating Zero - Shot Performance**: The paper evaluates the zero - shot performance of Segment Anything Model 2 (SAM 2) on the more challenging VOS datasets MOSE and LVOS without fine - tuning. 2. **Performance Comparison**: It is compared with the Cutie model, demonstrating the superior performance of SAM 2 in the zero - shot setting. 3. **Experimental Results**: On the 6th LSVOS Challenge VOS Track, SAM 2 achieved a J&F score of 75.79 on the test set and ranked fourth. ### Technical Methods: - **Model Architecture**: SAM 2 uses a simple Transformer architecture and introduces a streaming memory mechanism to enable real - time video processing. - **Data Engine**: The model and data are improved through user interaction, and the largest video segmentation dataset so far has been collected. - **Memory Mechanism**: SAM 2 utilizes the memory of past predictions and cue frames to generate the embedding of the current frame, thereby improving the segmentation accuracy. - **Multimodal Cues**: It supports multiple cue methods such as points, boxes, and masks to define the spatial extent of objects in the video. ### Experimental Setup: - **Datasets**: Two datasets, MOSE and LVOS, are used. These datasets contain complex scenes, such as disappearing and reappearing objects, inconspicuous small objects, severe occlusions, crowded environments, and long - time videos. - **Evaluation Metrics**: Region similarity (J, average IoU), contour accuracy (F, average boundary similarity), and their average (J&F) are used as evaluation metrics. ### Conclusion: The paper demonstrates the powerful performance of SAM 2 in the zero - shot setting, providing a reference for future VOS applications. The experimental results show that SAM 2 can still achieve excellent performance in complex VOS tasks without fine - tuning.