Abstract:Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at <a class="link-external link-https" href="https://github.com/ziplab/MPVSS" rel="external noopener nofollow">this https URL</a>.

Semantic Video CNNs through Representation Warping

Tamed Warping Network for High-Resolution Semantic Video Segmentation

Dynamic Warping Network for Semantic Video Segmentation

How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation

Real-time Semantic Segmentation with Weighted Factorized-Depthwise Convolution

How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation

Semantic Video Segmentation by Gated Recurrent Flow Propagation

Clockwork Convnets for Video Semantic Segmentation

UVid-Net: Enhanced Semantic Segmentation of UAV Aerial Videos by Embedding Temporal Information

Video Semantic Segmentation With Distortion-Aware Feature Correction

Efficient Semantic Segmentation for Compressed Video

A Semantics-Guided Warping for Semi-supervised Video Object Instance Segmentation.

Rethinking Dilated Convolution for Real-time Semantic Segmentation

Mask Propagation for Efficient Video Semantic Segmentation

TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge.

Capturing the Spatio-Temporal Continuity for Video Semantic Segmentation.

Efficient Semantic Video Segmentation with Per-Frame Inference

Video Semantic Segmentation Via Sparse Temporal Transformer.

STFCN: Spatio-Temporal FCN for Semantic Video Segmentation

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers