Abstract:This paper proposes a lightweight object tracking algorithm based on the transformer architecture. The joint attention module is introduced to leverage spatiotemporal context information, enhancing feature extraction capabilities. In addition, to address occlusion, the following two strategies are adopted. A position encoding generator module has been added to the transformer structure to obtain position discrimination. A dynamic template update strategy is added to increase template reliability. These two strategies greatly improve algorithm robustness while reducing computational requirements. At present, the multi‐object tracking method based on transformer generally uses its powerful self‐attention mechanism and global modelling ability to improve the accuracy of object tracking. However, most existing methods excessively rely on hardware devices, leading to an inconsistency between accuracy and speed in practical applications. Therefore, a lightweight transformer joint position awareness algorithm is proposed to solve the above problems. Firstly, a joint attention module to enhance the ShuffleNet V2 network is proposed. This module comprises the spatio‐temporal pyramid module and the convolutional block attention module. The spatio‐temporal pyramid module fuses multi‐scale features to capture information on different spatial and temporal scales. The convolutional block attention module aggregates channel and spatial dimension information to enhance the representation ability of the model. Then, a position encoding generator module and a dynamic template update strategy are proposed to solve the occlusion. Group convolution is adopted in the input sequence through position encoding generator module, with each convolution group responsible for handling the relative positional relationships of a specific range. In order to improve the reliability of the template, dynamic template update strategy is used to update the template at the appropriate time. The effectiveness of the approach is validated on the MOT16, MOT17, and MOT20 datasets.

Transformers only look once with nonlinear combination for real-time object detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

An Object Detection Method Based on Improved YOLOX

Efficient Decoder-Free Object Detection with Transformers

End-to-End Object Detection with Transformers

Combining transformer global and local feature extraction for object detection

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Training Strategies for Vision Transformers for Object Detection

CNN-transformer mixed model for object detection

Deformable DETR: Deformable Transformers for End-to-End Object Detection

YOLO-DCTI: Small Object Detection in Remote Sensing Base on Contextual Transformer Enhancement

Transformers for Object Detection in Large Point Clouds

L-DETR: A Light-Weight Detector for End-to-End Object Detection With Transformers

DETRs Beat YOLOs on Real-time Object Detection

An Extendable, Efficient and Effective Transformer-based Object Detector

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

A transformer‐based lightweight method for multiple‐object tracking