Abstract:Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.

What problem does this paper attempt to address?

This paper attempts to address the challenges in the Video Object Segmentation (VOS) task, especially for object segmentation in complex scenes. Specifically, the paper focuses on how to accurately segment specific object instances throughout the video sequence given the object mask in the first frame. This task is of great significance in multiple application areas such as robotics, video editing, and data annotation cost reduction. ### Main Contributions: 1. **Evaluating Zero - Shot Performance**: The paper evaluates the zero - shot performance of Segment Anything Model 2 (SAM 2) on the more challenging VOS datasets MOSE and LVOS without fine - tuning. 2. **Performance Comparison**: It is compared with the Cutie model, demonstrating the superior performance of SAM 2 in the zero - shot setting. 3. **Experimental Results**: On the 6th LSVOS Challenge VOS Track, SAM 2 achieved a J&F score of 75.79 on the test set and ranked fourth. ### Technical Methods: - **Model Architecture**: SAM 2 uses a simple Transformer architecture and introduces a streaming memory mechanism to enable real - time video processing. - **Data Engine**: The model and data are improved through user interaction, and the largest video segmentation dataset so far has been collected. - **Memory Mechanism**: SAM 2 utilizes the memory of past predictions and cue frames to generate the embedding of the current frame, thereby improving the segmentation accuracy. - **Multimodal Cues**: It supports multiple cue methods such as points, boxes, and masks to define the spatial extent of objects in the video. ### Experimental Setup: - **Datasets**: Two datasets, MOSE and LVOS, are used. These datasets contain complex scenes, such as disappearing and reappearing objects, inconspicuous small objects, severe occlusions, crowded environments, and long - time videos. - **Evaluation Metrics**: Region similarity (J, average IoU), contour accuracy (F, average boundary similarity), and their average (J&F) are used as evaluation metrics. ### Conclusion: The paper demonstrates the powerful performance of SAM 2 in the zero - shot setting, providing a reference for future VOS applications. The experimental results show that SAM 2 can still achieve excellent performance in complex VOS tasks without fine - tuning.

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

TransVOS: Video Object Segmentation with Transformers

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation

Scalable Video Object Segmentation with Identification Mechanism

1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Video Object Segmentation Based on Multi-Level Target Models and Feature Integration

Scalable Video Object Segmentation with Simplified Framework

VideoSAM: Open-World Video Segmentation

SAM 2: Segment Anything in Images and Videos

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Object-based spatial similarity for semi-supervised video object segmentation