Abstract:This paper presents the ARN-LSTM architecture, a novel multi-stream action recognition model designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences. Traditional methods often focus solely on spatial or temporal features, limiting their ability to comprehend complex human activities fully. Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture. Specifically, it comprises a jointstream for extracting skeleton features, a temporal stream for capturing dynamic temporal features, and an ARN-LSTM block that utilizes Time-Distributed Long Short-Term Memory (TD-LSTM) layers followed by an Attention Relation Network (ARN) to model temporal relations. The outputs from these streams are fused in a fully connected layer to provide the final action prediction. Evaluations on the NTU RGB+D 60 and NTU RGB+D 120 datasets outperform the superior performance of our model, particularly in group activity recognition.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in skeleton - based action recognition tasks, how to simultaneously capture spatial movement and temporal dynamic information, so as to more comprehensively understand complex human activities. Traditional methods usually only focus on spatial or temporal features, which limits their ability to comprehensively understand complex human activities. ### Specific description of the problem: 1. **Limitations of single feature**: - Traditional methods either focus on spatial features (such as human body postures, joint positions, etc.), or focus on temporal features (such as the temporal evolution of actions). This single - feature processing method cannot fully understand complex activities involving movement and temporal patterns. 2. **Lack of interaction information**: - Existing multi - person action recognition methods usually regard individuals as independent entities, ignoring the interaction information between individuals, especially in two - person interaction scenarios. 3. **Multi - scale spatio - temporal dependence**: - There are long - range cross - joint relationships in complex action sequences, and existing methods are difficult to effectively model these relationships, especially when facing challenges in selecting discriminative frames and joints. ### Solution: To solve the above problems, the paper proposes the ARN - LSTM architecture, which is a multi - stream fusion model, aiming to improve action recognition by integrating joint, movement and temporal information. Specifically: - **Multi - stream architecture**: The model includes three main parts: - **Joint Stream**: Extract skeleton features and capture spatial configurations. - **Temporal Stream**: Capture dynamic temporal features and model the temporal evolution of actions. - **ARN - LSTM block**: Use the Time - Distributed Long - Short - Term Memory Network (TD - LSTM) layer and Attention Relation Network (ARN) to model temporal relationships and enhance the correlation between joint and temporal features. - **Fusion mechanism**: Fuse the outputs from different streams through a fully - connected layer to generate the final action prediction. ### Main contributions: 1. Propose a multi - stream architecture that fuses joint, movement and temporal information, significantly improving action recognition ability. 2. Introduce the Attention Relation Network (ARN), which effectively captures and amplifies the mutual relationships between spatial and temporal features. 3. Achieve state - of - the - art performance on the NTU RGB + D 60/120 datasets, especially performing well in group activity recognition tasks. Through these innovations, the ARN - LSTM model can more comprehensively understand and process complex human activity sequences, especially those involving multi - person interactions.

ARN-LSTM: A Multi-Stream Fusion Model for Skeleton-based Action Recognition

Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks

A Skeleton-Based Assembly Action Recognition Method with Feature Fusion for Human-Robot Collaborative Assembly

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition

Temporal Enhanced Multi-Stream Graph Convolutional Nerual Networks For Skeleton-Based Action Recognition

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

Symmetrical Enhanced Fusion Network for Skeleton-Based Action Recognition

Relational Network for Skeleton-Based Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Skeleton-based Action Recognition Using LSTM and CNN

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Explorations of Skeleton Features for LSTM-based Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks

Temporal channel reconfiguration multi‐graph convolution network for skeleton‐based action recognition

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Multi-Stream Interaction Networks for Human Action Recognition

Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks