TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

Raghav Goyal,Wan-Cyuan Fan,Mennatullah Siam,Leonid Sigal

2024-04-10

Abstract:Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations -- a form of "soft" hard examples mining. Further, we propose a multiplicative time-coded memory, beyond vanilla additive positional encoding, which helps propagate context across long videos. Finally, we incorporate these in our proposed holistic multi-scale video transformer for tracking via multi-scale memory matching and decoding to ensure sensitivity and accuracy for long videos and small objects. Our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them. We illustrate that short clip length and longer memory with learned time-coding are important design choices for improved performance. Collectively, these technical contributions enable our model to achieve new state-of-the-art (SoTA) performance on two complex egocentric datasets -- VISOR and VOST, while achieving comparable to SoTA results on the conventional VOS benchmark, DAVIS'17. A series of detailed ablations validate our design choices as well as provide insights into the importance of parameter choices and their impact on performance.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key challenges in Video Object Segmentation (VOS), particularly in the performance on long videos and small objects. Specifically: 1. **Long Videos and Complex Deformations**: In long videos, objects may undergo both rigid and non-rigid deformations, including state changes (e.g., peeling or cutting a banana), which make it difficult for existing methods to effectively track and segment the target objects. 2. **Small Object Segmentation**: Small objects, due to their smaller size, tend to lose detailed information during matching, leading to poor segmentation results. 3. **Limitations of Existing Methods**: Current methods have limitations in handling the above challenges, especially when objects undergo significant deformations, making it difficult to maintain accurate tracking and segmentation. To address these issues, the authors propose a new framework—TAM-VT (Transformation-Aware Multi-scale Video Transformer), with the main contributions including: - Proposing a transformation-aware loss for focused learning on parts where objects undergo significant deformations. - Introducing a multiplicative time-coded memory module to better propagate contextual information. - Designing a DETR-style multi-scale video transformer that combines multi-scale memory matching and decoding to improve sensitivity and accuracy for long videos and small objects. - Validating the effectiveness of the design choices through a series of detailed ablation experiments and providing important insights into parameter selection and its impact on performance. Experimental results show that TAM-VT achieves state-of-the-art performance on two complex self-captured video datasets (VISOR and VOST) and performs excellently on the traditional VOS benchmark (DAVIS’17).

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

Dual Temporal Memory Network for Efficient Video Object Segmentation

VideoTrack: Learning to Track Objects Via Video Transformer

Structural Transformer with Region Strip Attention for Video Object Segmentation

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

Scalable Video Object Segmentation with Identification Mechanism

Associating Objects with Scalable Transformers for Video Object Segmentation

Siamese Network with Interactive Transformer for Video Object Segmentation

Video Object Segmentation Based on Multi-Level Target Models and Feature Integration

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Video Instance Segmentation Using Graph Matching Transformer

End-to-End Video Instance Segmentation with Transformers

TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss