Abstract:The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efficiency. This biased trade-off between robustness and efficiency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efficient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efficient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a unified channel-spatial attention mechanism, allowing it to efficiently extract significant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive fields containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efficiency of our method, showcasing significant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition.

Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Encoding Learning Network Combined with Feature Similarity Constraints for Human Action Recognition

DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition

Human Action Recognition Using Deep Learning Methods.

Human action recognition using attention based LSTM network with dilated CNN features

Hand-crafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

A Compact Representation of Human Actions by Sliding Coordinate Coding

Human Action Recognition From Digital Videos Based on Deep Learning.

Channel Attention-Based Approach with Autoencoder Network for Human Action Recognition in Low-Resolution Frames

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Deep Learning-Based Human Action Recognition in Videos

Multi-dimensional CNN Based Feature Extraction with Feature Fusion and SVM for Human Activity Recognition in Surveillance Videos

Dense Semantics-Assisted Networks For Video Action Recognition

Harmonizing space–time dynamics for precision in human action recognition

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network