Temporal Deformable Transformer for Action Localization

Haoying Wang,Ping Wei,Meiqin Liu,Nanning Zheng
DOI: https://doi.org/10.1007/978-3-031-44223-0_45
2023-01-01
Abstract:Temporal action localization (TAL) is a challenging task that has received significant attention in video understanding. Recently, Transformer-based models have demonstrated their effectiveness in capturing contextual information and achieved outstanding performance on various TAL benchmarks. However, these methods still face challenges in computational efficiency and contextual modeling rigidity. In this paper, we propose a method to address those problems in Transformer-based models. Our model introduces a temporal deformable Transformer module and the corresponding time normalization, enabling flexible aggregation of temporal context information in videos, leading to enhanced video representations. To demonstrate the effectiveness of the proposed method, we construct a Transformer-based anchor-free model with a simple prediction head, which yields superior performance on widely used benchmarks. Specifically, it achieves an average mAP of 67.4% on THUMOS14 and an average mAP of 36.8% on ActivityNet-v1.3.
What problem does this paper attempt to address?