ARN-LSTM: A Multi-Stream Fusion Model for Skeleton-based Action Recognition

Chuanchuan Wang,Ahmad Sufril Azlan Mohmamed,Mohd Halim Bin Mohd Noor,Xiao Yang,Feifan Yi,Xiang Li
2024-11-29
Abstract:This paper presents the ARN-LSTM architecture, a novel multi-stream action recognition model designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences. Traditional methods often focus solely on spatial or temporal features, limiting their ability to comprehend complex human activities fully. Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture. Specifically, it comprises a jointstream for extracting skeleton features, a temporal stream for capturing dynamic temporal features, and an ARN-LSTM block that utilizes Time-Distributed Long Short-Term Memory (TD-LSTM) layers followed by an Attention Relation Network (ARN) to model temporal relations. The outputs from these streams are fused in a fully connected layer to provide the final action prediction. Evaluations on the NTU RGB+D 60 and NTU RGB+D 120 datasets outperform the superior performance of our model, particularly in group activity recognition.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in skeleton - based action recognition tasks, how to simultaneously capture spatial movement and temporal dynamic information, so as to more comprehensively understand complex human activities. Traditional methods usually only focus on spatial or temporal features, which limits their ability to comprehensively understand complex human activities. ### Specific description of the problem: 1. **Limitations of single feature**: - Traditional methods either focus on spatial features (such as human body postures, joint positions, etc.), or focus on temporal features (such as the temporal evolution of actions). This single - feature processing method cannot fully understand complex activities involving movement and temporal patterns. 2. **Lack of interaction information**: - Existing multi - person action recognition methods usually regard individuals as independent entities, ignoring the interaction information between individuals, especially in two - person interaction scenarios. 3. **Multi - scale spatio - temporal dependence**: - There are long - range cross - joint relationships in complex action sequences, and existing methods are difficult to effectively model these relationships, especially when facing challenges in selecting discriminative frames and joints. ### Solution: To solve the above problems, the paper proposes the ARN - LSTM architecture, which is a multi - stream fusion model, aiming to improve action recognition by integrating joint, movement and temporal information. Specifically: - **Multi - stream architecture**: The model includes three main parts: - **Joint Stream**: Extract skeleton features and capture spatial configurations. - **Temporal Stream**: Capture dynamic temporal features and model the temporal evolution of actions. - **ARN - LSTM block**: Use the Time - Distributed Long - Short - Term Memory Network (TD - LSTM) layer and Attention Relation Network (ARN) to model temporal relationships and enhance the correlation between joint and temporal features. - **Fusion mechanism**: Fuse the outputs from different streams through a fully - connected layer to generate the final action prediction. ### Main contributions: 1. Propose a multi - stream architecture that fuses joint, movement and temporal information, significantly improving action recognition ability. 2. Introduce the Attention Relation Network (ARN), which effectively captures and amplifies the mutual relationships between spatial and temporal features. 3. Achieve state - of - the - art performance on the NTU RGB + D 60/120 datasets, especially performing well in group activity recognition tasks. Through these innovations, the ARN - LSTM model can more comprehensively understand and process complex human activity sequences, especially those involving multi - person interactions.