DTA: Deformable Temporal Attention for Video Recognition

Xiaohan Lei,Mengyu Yang,Gongli Xi,Yang Liu,Jiulin Li,Lanshan Zhang,Ye Tian
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650436
2024-01-01
Abstract:Recently, transformer models have demonstrated superior performance in video tasks. However, a prevalent limitation in most current video Transformers lies in their tendency to overlook inherent temporal regions of interest, such as motion trajectories, leading to susceptibility to redundant information during temporal modeling. Existing methods that pay attention to motion trajectories have high computational demands, lacking in lightweight efficiency. To strike a balance between effective modeling of temporal regions of interest and computational efficiency, we propose a video transformer backbone with deformable temporal attention (DTA). Inspired by the work on deformable receptive fields, DTA employs a lightweight decision network to enhance the flexibility of temporal attention. The decision network computes the offsets of tokens in the input feature map, enabling them to move to temporally relevant regions of interest and efficiently model temporal information. We conducted extensive experiments on three popular datasets and surpassed the baseline. Additionally, we performed ablation experiments specifically targeting the model structure and parameters. These results confirm the effectiveness of the proposed deformable temporal attention mechanism.
What problem does this paper attempt to address?