Abstract:Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in video segmentation: 1. **Difficulties in understanding object shapes, textures and contours on small - scale or class - imbalanced datasets**: Existing video segmentation models are usually built based on image segmentation techniques, which makes it difficult for them to understand and segment objects when dealing with small - scale or class - imbalanced datasets, often dividing a single object into multiple regions. 2. **Lack of consistency in segmentation results in the time dimension**: Current methods have inconsistent segmentation results in the time dimension, resulting in unstable segmentation of the same object between adjacent frames. 3. **Slow convergence due to a large number of queries in query - based models**: Using the cross - attention mechanism to locate the target segmentation area in the spatial and time dimensions requires a large number of training rounds, which makes the model inefficient in practical applications. To address these challenges, the authors propose the following solutions: - **Masked Video Consistency (MVC)**: By randomly occluding image patches and training the network to predict the semantic segmentation results of the entire image, the ability of the network to aggregate context information on spatial and time scales is enhanced. MVC introduces two different occlusion strategies, which are applied to the spatial feature extraction and object association modules respectively, providing the model with additional and challenging training tasks. - **Object Masked Attention (OMA)**: By reducing the influence of irrelevant queries in the cross - attention mechanism, the time - modeling ability is improved. OMA plays a role in the third stage (the time - feature - aggregation module). By randomly discarding some background - object queries, the model focuses more on the information and feature updates of foreground objects. Through these methods, the authors achieve state - of - the - art performance in the latest decoupled general - purpose video - segmentation framework without increasing the number of model parameters. Experimental results show that this method has achieved significant improvements in three video - segmentation tasks (video panoptic segmentation, video semantic segmentation and video instance segmentation) on five different datasets.

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Delving Deeper into Mask Utilization in Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Learning Quality-aware Dynamic Memory for Video Object Segmentation

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention

Efficient Video Segmentation Models with Per-frame Inference

Towards Robust Video Object Segmentation with Adaptive Object Calibration

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Video Object Segmentation via Global Consistency Aware Query Strategy.

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

Going Deeper into Embedding Learning for Video Object Segmentation

Learning to Segment Video Object with Accurate Boundaries.

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

Self Supervised Progressive Network for High Performance Video Object Segmentation