Abstract:Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at <a class="link-external link-https" href="https://cvlab-kaist.github.io/MUG-VOS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the limitations of existing video segmentation datasets and models when dealing with multi - granularity segmentation tasks. Specifically, current video segmentation benchmark datasets mainly focus on annotating salient target objects (i.e., foreground instances), while ignoring non - salient objects, partial objects, and background elements. This results in trained models being difficult to adapt to complex scenarios in practical applications, especially when dealing with diverse object classes and backgrounds. To overcome these problems, the authors propose a large - scale, densely - annotated multi - granularity video object segmentation dataset named **MUG - VOS**. MUG - VOS contains multiple types of mask annotations, covering objects of different granularities, partial objects, stuff, and background, and even includes content not covered in existing datasets. In addition, the authors also propose a Memory - based Mask Propagation Model (MMPM), which is trained and evaluated on the MUG - VOS dataset and shows the best performance among existing video object segmentation methods. ### Main contributions 1. **Constructing the MUG - VOS dataset**: This is a large - scale multi - granularity video object segmentation dataset, containing 77,994 video clips and 47 million masks, which can support the tracking of salient and non - salient objects. 2. **Proposing the MMPM model**: This model combines sequential memory and temporal memory modules, and can consistently track and generate high - quality segmentation masks in video sequences, especially when dealing with challenging scenarios such as occlusion, motion blur, and deformation. 3. **Improving the generalization ability of video segmentation tasks**: By introducing multi - granularity annotations, the MUG - VOS dataset helps develop models that can handle a wider range of object classes, thereby improving the robustness and flexibility of video segmentation tasks in practical applications. ### Formula summary - **IoU calculation**: \[ \text{IoU}(\tilde{M}_{t - 1}^i, C_t^i)=\frac{\text{Area of overlap}}{\text{Area of union}} \] where \(\tilde{M}_{t - 1}^i\) is the warped mask from frame \(t - 1\) to frame \(t\), and \(C_t^i\) is the candidate mask. - **Mask density calculation**: \[ D = \frac{\sum_{i = 1}^{H}\sum_{j = 1}^{W}M_{i,j}}{H\times W} \] where \(H\) is the height of the image, \(W\) is the width of the image, and \(M_{i,j}\) is the value indicating whether the pixel \((i, j)\) is covered by any mask (1 indicates covered, 0 indicates not covered). Through these innovations, the paper provides a new research direction in the field of video segmentation, especially making significant progress in dealing with multi - granularity segmentation tasks.

Multi-Granularity Video Object Segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

LiDAR Video Object Segmentation with Dynamic Kernel Refinement

Spectrum-guided Multi-granularity Referring Video Object Segmentation

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Scalable Video Object Segmentation with Identification Mechanism

YouMVOS: an Actor-centric Multi-shot Video Object Segmentation Dataset

Towards Robust Video Object Segmentation with Adaptive Object Calibration

YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

Dual Temporal Memory Network for Efficient Video Object Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

Training-Free Robust Interactive Video Object Segmentation

Self Supervised Progressive Network for High Performance Video Object Segmentation

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation