Abstract:The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efficiency. This biased trade-off between robustness and efficiency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efficient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efficient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a unified channel-spatial attention mechanism, allowing it to efficiently extract significant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive fields containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efficiency of our method, showcasing significant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition.

Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition

Exploiting deep residual networks for human action recognition from skeletal data

Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Skeleton-based Action Recognition Using LSTM and CNN

Deep learning-based multi-view 3D-human action recognition using skeleton and depth data

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

End-to-end Learning of Deep Convolutional Neural Network for 3D Human Action Recognition

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

3D Action Recognition Using Data Visualization and Convolutional Neural Networks.

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Accurate And Real-Time Human Action Recognition Based On 3d Skeleton

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Spatio-temporal attention on manifold space for 3D human action recognition