Abstract:The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efficiency. This biased trade-off between robustness and efficiency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efficient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efficient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a unified channel-spatial attention mechanism, allowing it to efficiently extract significant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive fields containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efficiency of our method, showcasing significant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition.

Context-Aware Faster RCNN for CSI-Based Human Action Perception

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Wi-ATCN: Attentional Temporal Convolutional Network for Human Action Prediction Using WiFi Channel State Information

WiFi-based Spatiotemporal Human Action Perception

A Real-time Object Detection for WiFi CSI-based Multiple Human Activity Recognition

Simultaneous Implementation Features Extraction and Recognition Using C3D Network for WiFi-based Human Activity Recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Context-Aware RCNN: A Baseline for Action Detection in Videos

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network

An improved Wi-Fi sensing-based human activity recognition using multi-stage deep learning model

GraSens: A Gabor Residual Anti-aliasing Sensing Framework for Action Recognition using WiFi

WiFi Sensing for Drastic Activity Recognition with CNN-BiLSTM Architecture.

CSI-F: A Human Motion Recognition Method Based on Channel-State-Information Signal Feature Fusion

Weighted Multi-Region Convolutional Neural Network for Action Recognition with Low-Latency Online Prediction

Multi-Scale Based Context-Aware Net for Action Detection.

CDFi: Cross-Domain Action Recognition using WiFi Signals

Environment-Robust Device-Free Human Activity Recognition With Channel-State-Information Enhancement and One-Shot Learning

Contextual Multi-Scale Region Convolutional 3D Network for Activity Detection

Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network