Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Han Zhao,Xinyu Jin
DOI: https://doi.org/10.1109/iccia49625.2020.00028
2020-01-01
Abstract:The attention mechanism based models for computer vision and natural language processing are widely utilized, and action recognition in videos is no exception. In this paper, we develop a novel convolutional and recurrent network for action recognition which is "doubly deep" in spatial and temporal layers. First, in the feature extraction stage, we propose an improved p-non-local operations as a simple and effective component to capture long-distance dependencies with deep convolutional neural networks. Second, in the class prediction stage, we propose Fusion KeyLess Attention combining with the forward and backward bidirectional LSTM to learn the sequential nature of the data more efficiently and elegantly, which uses multi-epoch models fusion based on confusion matrix. Experiments on two heterogeneous datasets, HMDB51 and Hollywood2 show that our model has distinct advantages over traditional models also only utilizing RGB features for action recognition based on CNN and RNN.
What problem does this paper attempt to address?