Cross Time-Frequency Transformer for Temporal Action Localization

Jin Yang,Ping Wei,Nanning Zheng
DOI: https://doi.org/10.1109/tcsvt.2023.3326692
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Most modern approaches in temporal action localization (TAL) mainly focus on time domain information, while neglecting the advantages of information from other domains. How to effectively utilize information from different domains and their interactions in a reasonable manner has been an attractive yet challenging issue in TAL. In this paper, we propose a novel cross time-frequency Transformer model (TFFormer) for TAL. A dual-branch network architecture is designed to capture the time and frequency features at multiple scales, using the multi-scale transformer in the time branch and the DB1 Discrete Wavelet Transform (DWT) in the frequency branch. To fuse these features from different domains, we propose a cross time-frequency attention mechanism that includes a time pathway and a frequency pathway, enhancing the interaction between the temporal and frequency features. Furthermore, a gated control mechanism is designed to aggregate features from different scales, characterizing the respective contributions of features at different scales. We also design a new regression loss function for locating the time boundaries. Extensive experiments were carried out on four challenging benchmark datasets, including two third-person datasets and two first-person datasets. The proposed method achieves impressive results on these datasets. Specifically, TFFormer achieves an average mAP of 23.2% on Ego4D and 25.6% on EPIC-Kitchens 100, which outperform previous state-of-the-arts by a large margin. It also obtains competitive results on ActivityNet v1.3 and THUMOS14, with an average mAP of 36.2% and 67.8%. We also conducted extensive ablation studies to validate the effectiveness of each component in the proposed method.
What problem does this paper attempt to address?