Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

A Transformer-based System for Action Spotting in Soccer Videos

SpotFormer: A Transformer-based Framework for Precise Soccer Action Spotting

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

Towards Active Learning for Action Spotting in Association Football Videos

Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Improved Soccer Action Spotting using both Audio and Video Streams

Deep learning for action spotting in association football videos

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Deep Understanding of Soccer Match Videos

A Foundation Model for Soccer

A Multi-Modal Transformer Approach for Football Event Classification

ActionFormer: Localizing Moments of Actions with Transformers

A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

FootBots: A Transformer-based Architecture for Motion Prediction in Soccer

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference