MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation

Zhenyu Chen,Lu Zhang,Ping Hu,Huchuan Lu,You He
DOI: https://doi.org/10.1109/TNNLS.2024.3469959
2024-10-21
Abstract:Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model's understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.
What problem does this paper attempt to address?