ConvTransformer Attention Network for temporal action detection

Di Cui,Chang Xin,Lifang Wu,Xiangdong Wang
DOI: https://doi.org/10.1016/j.knosys.2024.112264
IF: 8.139
2024-07-26
Knowledge-Based Systems
Abstract:Boundary detection is a challenging problem in Temporal Action Detection (TAD). While transformer-based methods achieve satisfactory results by incorporating self-attention mechanisms to model global dependencies for boundary detection, they face two key issues. Firstly, they lack explicit learning of local relationships; this limitation results in imprecise boundary detection when subtle appearance changes occur between adjacent clips. Secondly, transformer-based methods lead to feature convergence across multiple actions due to the self-attention mechanism's tendency to distribute focus across the entire input video, resulting in the prediction of imprecisely overlapping actions. To address these challenges, we introduce the ConvTransformer Attention Network (CTAN), a novel framework comprised of two primary components: (1) The Temporal Attention Block (TAB), a temporal attention mechanism designed to emphasize critical temporal positions enriched with essential action-related features. (2) The ConvTransformer Block (CTB), which employs a hybrid structure for capturing nuanced appearance changes locally and action transitions globally. Facilitated with these components, CTAN is adept at focusing on motion features between overlapping actions, and precisely capturing both local differences between adjacent clips and global action transitions. The extensive experiments on multiple datasets, including THUMOS14, MultiTHUMOS, and ActivityNet, confirm the effectiveness of CTAN.
computer science, artificial intelligence
What problem does this paper attempt to address?