CTM: Cross-time temporal module for fine-grained action recognition
Huifang Qian,Jialun Zhang,Jianping Yi,Zhenyu Shi,Yimin Zhang
DOI: https://doi.org/10.1016/j.cviu.2024.104013
IF: 4.886
2024-04-13
Computer Vision and Image Understanding
Abstract:Dynamic contextual attribute information in the time dimension is the key to fine-grained action recognition. Temporal contextual relationships cannot be captured by conventional 2D CNNs; good local time can be obtained by the 3D CNNs, but the 3D CNNs are computationally intensive and lack capability for global time. A parallel cross-time temporal module-CTM is proposed in this article, which aims to efficiently capture dynamic contextual information of both local and global temporal dimensions. With our study, we think that the 2D CNNs can better mine temporal features to enrich the contextual relationships of temporal dimensions. The CTM can be embedded into any existing 2D CNNs baseline in a plug-and-play manner, yielding a feature framework that can capture complex spatio-temporal modeling (CTNet) with a tiny additional computational cost. In the extensive validation experiments on three datasets(i.e., SomethingV1&V2, Jester, Diving48), both action recognition accuracy and runtime inference speed are obviously better than existing temporal contextual baseline optimization schemes with similar computational cost complexity, when the CTM embedded into any 2D CNNs framework to enhance the baseline.
computer science, artificial intelligence,engineering, electrical & electronic