A2VIS: Amodal-Aware Approach to Video Instance Segmentation

Minh Tran,Thang Pham,Winston Bounsavy,Tri Nguyen,Ngan Le
2024-12-02
Abstract:Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively handle object occlusion in Video Instance Segmentation (VIS) or Multiple - Object Tracking and Segmentation (MOTS) tasks. Specifically, existing VIS methods face challenges when dealing with partially or completely occluded objects. Especially in long - time sequences, objects may be severely occluded and then reappear, which may lead to identity switching or change. These problems are relatively common in existing VIS methods because these methods mainly rely on processing visible elements and ignore a comprehensive understanding of objects when they are partially occluded. To solve this problem, the paper proposes a novel framework - Amodal - Aware Video Instance Segmentation (A2VIS). This framework achieves a reliable and comprehensive understanding of the visible and occluded parts of objects in videos by introducing amodal awareness. The key to A2VIS is that modal segmentation perception in the spatio - temporal dimension can provide more stable object information flow. In the case of partial or complete occlusion of objects, modal segmentation has less change on the time axis than visible segmentation, so it can better maintain the identity of objects. To effectively address the challenges of video modal segmentation, A2VIS introduces the Spatiotemporal - prior Amodal Mask Head (SAMH). This module uses short - range and long - range spatio - temporal information to predict modal masks. Short - range information comes from visible segmentation in adjacent frames, while long - range information comes from modal segmentation in the entire video. These prior information are modeled through a mask attention mechanism with Visible Spatio - Temporal Prior Mask (VSPM) and Amodal Spatio - Temporal Prior Mask (ASPM). In addition, A2VIS also introduces global instance prototypes to capture the visible and modal features of objects in videos, thereby achieving more robust object updating and association throughout the video, especially in occlusion scenarios. Through extensive experiments and ablation studies on multiple benchmarks, the paper demonstrates that A2VIS has significant advantages in identifying and tracking object instances, especially in understanding the complete shape of objects, showing performance superior to existing state - of - the - art VIS and MOT methods.