Abstract:This paper, presents a multi‐stream CNN‐based model that leverages various attention modules to analyze multimodal data obtained from video cameras and Kinect sensors. A novel motion encoding technique, orientation‐magnitude response maps, is introduced to effectively capture the historical motion of actions, representing long‐frame sequences as a single motion history image. This paper introduces a new descriptor called orientation‐magnitude response maps as a single 2D image to effectively explore motion patterns. Moreover, boosted multi‐stream CNN‐based model with various attention modules is designed for human action recognition. The model incorporates a convolutional self‐attention autoencoder to represent compressed and high‐level motion features. Sequential convolutional self‐attention modules are used to exploit the implicit relationships within motion patterns. Furthermore, 2D discrete wavelet transform is employed to decompose RGB frames into discriminative coefficients, providing supplementary spatial information related to the actors actions. A spatial attention block, implemented through the weighted inception module in a CNN‐based structure, is designed to weigh the multi‐scale neighbours of various image patches. Moreover, local and global body pose features are combined by extracting informative joints based on geometry features and joint trajectories in 3D space. To provide the importance of specific channels in pose descriptors, a multi‐scale channel attention module is proposed. For each data modality, a boosted CNN‐based model is designed, and the action predictions from different streams are seamlessly integrated. The effectiveness of the proposed model is evaluated across multiple datasets, including HMDB51, UTD‐MHAD, and MSR‐daily activity, showcasing its potential in the field of action recognition.

Human Action Recognition: Pose-based Attention draws focus to Hands

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

First-Person Hand Action Recognition Using Multimodal Data

An attention-aware model for human action recognition on tree-based skeleton sequences

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatio-temporal attention on manifold space for 3D human action recognition

Context-Aware Cross-Attention for Skeleton-Based Human Action Recognition

Attention-Based Pose Sequence Machine for 3D Hand Pose Estimation

Attention-based hand pose estimation with voting and dual modalities

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Spectral studies on metal-ligand bonding of novel rhodanine azodye sulphadrugs.

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Temporal Attentive Network for Action Recognition