Abstract:Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively handle object occlusion in Video Instance Segmentation (VIS) or Multiple - Object Tracking and Segmentation (MOTS) tasks. Specifically, existing VIS methods face challenges when dealing with partially or completely occluded objects. Especially in long - time sequences, objects may be severely occluded and then reappear, which may lead to identity switching or change. These problems are relatively common in existing VIS methods because these methods mainly rely on processing visible elements and ignore a comprehensive understanding of objects when they are partially occluded. To solve this problem, the paper proposes a novel framework - Amodal - Aware Video Instance Segmentation (A2VIS). This framework achieves a reliable and comprehensive understanding of the visible and occluded parts of objects in videos by introducing amodal awareness. The key to A2VIS is that modal segmentation perception in the spatio - temporal dimension can provide more stable object information flow. In the case of partial or complete occlusion of objects, modal segmentation has less change on the time axis than visible segmentation, so it can better maintain the identity of objects. To effectively address the challenges of video modal segmentation, A2VIS introduces the Spatiotemporal - prior Amodal Mask Head (SAMH). This module uses short - range and long - range spatio - temporal information to predict modal masks. Short - range information comes from visible segmentation in adjacent frames, while long - range information comes from modal segmentation in the entire video. These prior information are modeled through a mask attention mechanism with Visible Spatio - Temporal Prior Mask (VSPM) and Amodal Spatio - Temporal Prior Mask (ASPM). In addition, A2VIS also introduces global instance prototypes to capture the visible and modal features of objects in videos, thereby achieving more robust object updating and association throughout the video, especially in occlusion scenarios. Through extensive experiments and ablation studies on multiple benchmarks, the paper demonstrates that A2VIS has significant advantages in identifying and tracking object instances, especially in understanding the complete shape of objects, showing performance superior to existing state - of - the - art VIS and MOT methods.

A2VIS: Amodal-Aware Approach to Video Instance Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Self-supervised Amodal Video Object Segmentation

Foundation Models for Amodal Video Instance Segmentation in Automated Driving

Amodal Segmentation Based on Visible Region Segmentation and Shape Prior

Using Diffusion Priors for Video Amodal Segmentation

Coarse-to-Fine Amodal Segmentation with Shape Prior

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

Audio-Visual Instance Segmentation

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

Amodal segmentation just like doing a jigsaw

Context-Aware Video Instance Segmentation

Amodal Instance Segmentation Via Prior-Guided Expansion.

Learning Spatial-Semantic Features for Robust Video Object Segmentation