Abstract:The authors propose a deep‐learning framework, CTGMOT, for multi‐object tracking (MOT) in complex team sports videos. The backbone network of the framework combines CNN and Transformers to extract local and global features, and uses parallel decoders to fuse appearance and motion features. To accurately capture spatial‐temporal correlations, the framework adopts GNN and an attention mechanism to fuse the spatial tracking features of objects within frames as well as the temporal tracking features across different frames, which better distinguishes fast‐moving and occluded targets and improves the performance of online MOT. In response to the challenges of Multi‐Object Tracking (MOT) in sports scenes, such as severe occlusions, similar appearances, drastic pose changes, and complex motion patterns, a deep‐learning framework CTGMOT (CNN‐Transformer‐GNN‐based MOT) specifically for multiple athlete tracking in sports videos that performs joint modelling of detection, appearance and motion features is proposed. Firstly, a detection network that combines Convolutional Neural Networks (CNN) and Transformers is constructed to extract both local and global features from images. The fusion of appearance and motion features is achieved through a design of parallel dual‐branch decoders. Secondly, graph models are built using Graph Neural Networks (GNN) to accurately capture the spatio‐temporal correlations between object and trajectory features from inter‐frame and intra‐frame associations. Experimental results on the public sports tracking dataset SportsMOT show that the proposed framework outperforms other state‐of‐the‐art methods for MOT in complex sport scenes. In addition, the proposed framework shows excellent generality on benchmark datasets MOT17 and MOT20.

Online Multiplayer Tracking by Extracting Temporal Contexts with Transformer

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

TransVOS: Video Object Segmentation with Transformers

TransLink: Transformer-Based Embedding for Tracklets’ Global Link

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Exploiting spatial and temporal context for online tracking with improved transformer

Exploiting Temporal Coherence for Self-Supervised Visual Tracking by Using Vision Transformer

Exploring Multi-Modal Spatial-Temporal Contexts for High-Performance RGB-T Tracking

PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking

CXTrack: Improving 3D Point Cloud Tracking with Contextual Information

A deep learning framework for multi‐object tracking in team sports videos

Robust multi-object tracking via cross-domain contextual information for sports video analysis

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

MOTR: End-to-End Multiple-Object Tracking with Transformer

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

Effective and Robust: A Discriminative Temporal Learning Transformer for Satellite Videos

TrackFormer: Multi-Object Tracking with Transformers