LGAFormer: transformer with local and global attention for action detection
Haiping Zhang,Fuxing Zhou,Dongjing Wang,Xinhao Zhang,Dongjin Yu,Liming Guan
DOI: https://doi.org/10.1007/s11227-024-06138-1
IF: 3.3
2024-05-08
The Journal of Supercomputing
Abstract:Temporal action detection is a very important task in video understanding, aiming at predicting the start and end time boundaries of all action instances in an unedited video and their action classification. This task has been widely studied, especially after the transformer has been widely used in the field of vision. However, transformer model brings massive computational resource consumption when processing long sequence input data. At the same time, due to the ambiguity of the video action boundary, many proposal and instance-based methods cannot predict the video boundary accurately. Based on the above two points, we propose LGAFormer: a very concise model that combines the local self-attention mechanism with the global self-attention mechanism, using local self-attention in the shallow layer of the network to process short-range temporal data to model local representations while reducing a large amount of computational consumption, and using global self-attention in the deep layer of the network to model long-range temporal context. This allows our model to achieve a good balance between effectiveness and efficiency. And in terms of detection head, we combine the advantages of the segment feature and instance feature to predict action boundaries more accurately. Thanks to these two points, our method achieves comparable results on all three datasets (THUMOS14, ActivityNet 1.3, and EPIC-Kitchens 100). On THUMOS14, an average mAP of 67.7% was obtained. On ActivityNet 1.3, the best performance is obtained with an average mAP of 36.6%. On EPIC-Kitchens 100, an average mAP of 24.6% was achieved.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture