Abstract:This paper, presents a multi‐stream CNN‐based model that leverages various attention modules to analyze multimodal data obtained from video cameras and Kinect sensors. A novel motion encoding technique, orientation‐magnitude response maps, is introduced to effectively capture the historical motion of actions, representing long‐frame sequences as a single motion history image. This paper introduces a new descriptor called orientation‐magnitude response maps as a single 2D image to effectively explore motion patterns. Moreover, boosted multi‐stream CNN‐based model with various attention modules is designed for human action recognition. The model incorporates a convolutional self‐attention autoencoder to represent compressed and high‐level motion features. Sequential convolutional self‐attention modules are used to exploit the implicit relationships within motion patterns. Furthermore, 2D discrete wavelet transform is employed to decompose RGB frames into discriminative coefficients, providing supplementary spatial information related to the actors actions. A spatial attention block, implemented through the weighted inception module in a CNN‐based structure, is designed to weigh the multi‐scale neighbours of various image patches. Moreover, local and global body pose features are combined by extracting informative joints based on geometry features and joint trajectories in 3D space. To provide the importance of specific channels in pose descriptors, a multi‐scale channel attention module is proposed. For each data modality, a boosted CNN‐based model is designed, and the action predictions from different streams are seamlessly integrated. The effectiveness of the proposed model is evaluated across multiple datasets, including HMDB51, UTD‐MHAD, and MSR‐daily activity, showcasing its potential in the field of action recognition.

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Channel attention and multi-scale graph neural networks for skeleton-based action recognition

CANet: Comprehensive Attention Network for video-based action recognition

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Spatio-Temporal Attention Networks for Action Recognition and Detection

ACTION-Net: Multipath Excitation for Action Recognition

Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification

XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

Convolutional Neural Network with Multi-Head Attention for Human Activity Recognition

An efficient attention module for 3d convolutional neural networks in action recognition