Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Hao Jin,Jianming Yang,Sheng Zhang
DOI: https://doi.org/10.1109/icicsp54369.2021.9611970
2021-01-01
Abstract:The mainstream methods in video action recognition includes 3D convolutional neural networks (CNNs) and two-stream networks. Recently, the Transformer architecture which is outstanding in sequence modeling has made a breakthrough in Natural Language Processing (NLP) tasks. In this paper, we proposed a simple but effective approach to promote the performance of the action recognition. Firstly, the improved Transformer is designed to ameliorate the dependencies and reduce the error of representation of features. Then we employ a R(2+1)D convolution to capture the low-level partial spatiotemporal features of patching clips split from the whole input video sequence. After that, we stack all the feature maps to build a latent time domain sequential data which will feed into the improved Transformer to obtain the global attention and model the high-level temporal information. Finally, a method similar to channel attention is used to update the weight of the latent time domain dimension of sequential data dynamically which better maps the features to class labels. Experimental results show that our method performs superior compared with the state-of-the-art works.
What problem does this paper attempt to address?