Multi-granularity transformer fusion for temporal action localization

Min Zhang,Haiyang Hu,Zhongjin Li
DOI: https://doi.org/10.1007/s00500-024-09955-x
IF: 3.732
2024-07-19
Soft Computing
Abstract:Temporal action localization plays a significant role in video understanding, which aims to recognize action category as well as temporal interval in untrimmed videos. Most of previous transformer-based methods employ a feature space of single-temporal granularity. However, low-level temporal features can not provide enough semantic information for action recognition while high-level temporal features lack rich details for boundary localization. To address the above issue, we propose a multi-granularity transformer fusion framework (MGTF) to localize temporal actions in videos. Specifically, the MGTF builds a multi-granularity feature fusion pipeline based on transformer, and uses a direct set prediction strategy to generate action instances. Through top-down cross-granularity attention interaction, the low-level features of boundary details and high-level semantic information can be combined to improve the feature discrimination. To reduce computation cost, we design temporal shift attention to adaptively focus on a sparse set of key segments. In addition, actionness regression head is utilized to refine the confidence score of different candidate instances. As a self-contained system, MGTF achieves state-of-the-art performance on THUMOS'14 and comparable performance on ActivityNet-1.3. Ablation studies and qualitative visualization also demonstrate the effectiveness of the proposed approach.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?