Abstract:Human action recognition in videos is an active area of research in computer vision and pattern recognition. Nowadays, artificial intelligence (AI) based systems are needed for human-behavior assessment and security purposes. The existing action recognition techniques are mainly using pre-trained weights of different AI architectures for the visual representation of video frames in the training stage, which affect the features' discrepancy determination such as distinction between visual and temporal signs. To address this issue, we propose a bi-directional long short-term memory-based attention mechanism with a dilated convolutional neural network (DCNN) that selectively focuses on effective features in the input frame to recognize different human actions in videos. In this diverse network, we use the DCNN layers to extract salient discriminative features by using the residual blocks to upgrade features that keep more information than a shallow layer. Furthermore, we feed these features into a bi-directional long-short term memory (BiLSTM) to learn long-term dependencies followed by the attention mechanism to boost the performance and extract additional high-level selective action related patterns and cues. We further use the center loss with Softmax to improve the loss function that achieves higher performance in video based action classification. The proposed system is evaluated on three benchmarks, i.e., UCF11, UCF sports, and J-HMDB datasets for which it achieved a recognition rate of 98.3%, 99.1%, and 80.2%, respectively, showing 1%–3% improvement compared to the baseline state-of-the-arts on each dataset.

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Joint Network based Attention for Action Recognition

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

Action Recognition with Joint Attention on Multi-Level Deep Features

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Multi-modality Fusion Network for Action Recognition.

Human action recognition using attention based LSTM network with dilated CNN features

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data

Multi-scale residual network model combined with Global Average Pooling for action recognition

Hierarchical Multi-scale Attention Networks for Action Recognition

Human-centric multimodal fusion network for robust action recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization