Human action recognition using attention based LSTM network with dilated CNN features
Khan Muhammad,Mustaqeem,Amin Ullah,Ali Shariq Imran,Muhammad Sajjad,Mustafa Servet Kiran,Giovanna Sannino,Victor Hugo C. de Albuquerque
DOI: https://doi.org/10.1016/j.future.2021.06.045
IF: 7.307
2021-12-01
Future Generation Computer Systems
Abstract:Human action recognition in videos is an active area of research in computer vision and pattern recognition. Nowadays, artificial intelligence (AI) based systems are needed for human-behavior assessment and security purposes. The existing action recognition techniques are mainly using pre-trained weights of different AI architectures for the visual representation of video frames in the training stage, which affect the features' discrepancy determination such as distinction between visual and temporal signs. To address this issue, we propose a bi-directional long short-term memory-based attention mechanism with a dilated convolutional neural network (DCNN) that selectively focuses on effective features in the input frame to recognize different human actions in videos. In this diverse network, we use the DCNN layers to extract salient discriminative features by using the residual blocks to upgrade features that keep more information than a shallow layer. Furthermore, we feed these features into a bi-directional long-short term memory (BiLSTM) to learn long-term dependencies followed by the attention mechanism to boost the performance and extract additional high-level selective action related patterns and cues. We further use the center loss with Softmax to improve the loss function that achieves higher performance in video based action classification. The proposed system is evaluated on three benchmarks, i.e., UCF11, UCF sports, and J-HMDB datasets for which it achieved a recognition rate of 98.3%, 99.1%, and 80.2%, respectively, showing 1%–3% improvement compared to the baseline state-of-the-arts on each dataset.