TwinFormer: Fine-to-Coarse Temporal Modeling for Long-term Action Recognition

Jiaming Zhou,Kun-Yu Lin,Yu-Kun Qiu,Wei-Shi Zheng
DOI: https://doi.org/10.1109/tmm.2023.3302471
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:The long-term action in untrimmed video generally contains multiple sub-actions, among which various semantic patterns exist ( e.g. , the co-occurrence or sequentiality between sub-actions). These semantic patterns are temporally coarse, and correlated with multiple local contexts which encode the local temporal evolution of visual elements ( e.g. , hands, objects) in videos. The local contexts and semantic patterns form the inherent fine-to-coarse temporal structure of long-term actions, which is neglected by existing works. Accordingly, in this work we propose TwinFormer, which exploits a novel fine-to-coarse temporal modeling manner to uncover the temporal structure of long-term actions. The proposed TwinFormer consists of a pair of twin encoders with the same structural design, namely Localcontext Encoder and Semantic-pattern Encoder, and a Temporalbridged Attention to bridge the two twin encoders. The Localcontext Encoder aims to model the local contexts in the longterm action. And the Temporal-bridged Attention is designed to correlate the local contexts with semantic patterns. Furthermore, the Semantic-pattern Encoder reveals the temporal evolution of semantic patterns. Experimental results on three benchmarks demonstrate the effectiveness of the proposed model.
What problem does this paper attempt to address?