Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Chen Liang,Qiang Guo,Xiaochao Qu,Luoqi Liu,Ting Liu
2024-08-20
Abstract:Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video segmentation: 1. **Difficulties in understanding object shapes, textures and contours on small - scale or class - imbalanced datasets**: Existing video segmentation models are usually built based on image segmentation techniques, which makes it difficult for them to understand and segment objects when dealing with small - scale or class - imbalanced datasets, often dividing a single object into multiple regions. 2. **Lack of consistency in segmentation results in the time dimension**: Current methods have inconsistent segmentation results in the time dimension, resulting in unstable segmentation of the same object between adjacent frames. 3. **Slow convergence due to a large number of queries in query - based models**: Using the cross - attention mechanism to locate the target segmentation area in the spatial and time dimensions requires a large number of training rounds, which makes the model inefficient in practical applications. To address these challenges, the authors propose the following solutions: - **Masked Video Consistency (MVC)**: By randomly occluding image patches and training the network to predict the semantic segmentation results of the entire image, the ability of the network to aggregate context information on spatial and time scales is enhanced. MVC introduces two different occlusion strategies, which are applied to the spatial feature extraction and object association modules respectively, providing the model with additional and challenging training tasks. - **Object Masked Attention (OMA)**: By reducing the influence of irrelevant queries in the cross - attention mechanism, the time - modeling ability is improved. OMA plays a role in the third stage (the time - feature - aggregation module). By randomly discarding some background - object queries, the model focuses more on the information and feature updates of foreground objects. Through these methods, the authors achieve state - of - the - art performance in the latest decoupled general - purpose video - segmentation framework without increasing the number of model parameters. Experimental results show that this method has achieved significant improvements in three video - segmentation tasks (video panoptic segmentation, video semantic segmentation and video instance segmentation) on five different datasets.