Sparse Transformer-Based Algorithm for Long-Short Temporal Association Action Recognition

Yue Lu,Yingyun Yang
DOI: https://doi.org/10.1109/cost60524.2023.00027
2023-01-01
Abstract:In order to recognize actions with a long time span and model the global timing information of videos, this paper combines 3D Convolutional Neural Networks(3DCNN) and Transformer to propose a sparse Transformer-based long-short temporal association action recognition algorithm. The algorithm uses a pre-trained model to extract clip features, embeds a video feature clustering module to reduce the potential noise of the input features, and uses a Transformer long-short temporal association module based on sparse self-attentiveness which introduces a sparse mask matrix masking operations on the similarity matrix to suppress smaller attention weights, selectively retain important long-short temporal information, and improve the model's attention concentration on global contextual information. Experimental results show the model can achieve the Top-1 accuracy of 97.41% on UCF101 and 78.79% on HMDB51 with small number of parameters and computational complexity.
What problem does this paper attempt to address?