Temporal adaptive feature pyramid network for action detection

Xuezhi Xiang,Hang Yin,Yulong Qiao,Abdulmotaleb El Saddik
DOI: https://doi.org/10.1016/j.cviu.2024.103945
IF: 4.886
2024-01-26
Computer Vision and Image Understanding
Abstract:Detecting actions in videos has become a prominent research task due to its wide application. In addition to recognizing action category, this task also needs to localize the start time and end time of each action instance, which requires the model to have high temporal modeling capability. Moreover, the duration between each action instance is often different and highly variable. Although previous works have made attempts to address this difficulty, it is still a persistent problem. To further address the difficulty, we propose an action detection network using temporal feature pyramid, which can collect data using cameras and predict precise action categories and localizations. Specifically, we introduce a temporal adaptive module, which mixes self-attention and 1D convolution to flexibly adjust the temporal receptive field to improve the temporal modeling ability for different actions. We also propose a channel adaptive module to adjust channel weights and suppress useless information. We then propose the Temporal Adaptive Feature Pyramid Network(TAFPN) by integrating the two modules to adaptively extract multi-scale temporal information. We also improve the traditional parallel head into a unified head by stacking channel adaptive modules to simplify the network structure. Experimental results on the THUMOS14 dataset and ActivityNet1.3 dataset show that our method is competitive with state-of-the-art methods, which proves the effectiveness of our method.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?