Abstract:Human action recognition remains an important yet challenging task. This work proposes a novel action recognition system. It uses a novel Multiple View Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM) formulation combined with appearance information. Multiple stream 3D Convolutional Neural Networks (CNNs) are trained on the different views and time resolutions of the region adaptive Depth Motion Maps. Multiple views are synthesised to enhance the view invariance. The region adaptive weights, based on localised motion, accentuate and differentiate parts of actions possessing faster motion. Dedicated 3D CNN streams for multi-time resolution appearance information (RGB) are also included. These help to identify and differentiate between small object interactions. A pre-trained 3D-CNN is used here with fine-tuning for each stream along with multiple class Support Vector Machines (SVM)s. Average score fusion is used on the output. The developed approach is capable of recognising both human action and human-object interaction. Three public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view actions and MSR 3D daily activity are used to evaluate the proposed solution. The experimental results demonstrate the robustness of this approach compared with state-of-the-art algorithms.

What problem does this paper attempt to address?

The paper attempts to address several key challenges in human action recognition: 1. **Multi-view Adaptability**: Robustness and accuracy of action recognition under different viewpoints. Traditional action recognition methods are usually sensitive to viewpoint changes, leading to decreased recognition performance under different viewpoints. 2. **Multi-temporal Resolution**: Differences in the performance of different actions at different time scales. Some actions may be completed in a short time, while others require a longer duration. Therefore, methods with a single temporal resolution are difficult to adapt to all types of actions. 3. **Local Motion Weighting**: In different actions, the motion in certain regions is more important. Traditional methods often fail to effectively distinguish these important motion regions, resulting in decreased recognition accuracy. 4. **Utilization of Appearance Information**: In addition to depth information, appearance information (such as RGB images) is also very important for recognizing certain types of actions, especially those involving object interactions. Traditional methods often rely solely on depth information, ignoring the value of appearance information. To address these issues, the paper proposes a new action recognition system that combines multi-view, multi-temporal resolution depth motion maps (DMM) and appearance information (RGB), and trains through multi-stream 3D convolutional neural networks (3D CNN). Specifically, the system includes the following components: - **Multi-view Region Adaptive Multi-temporal Resolution Depth Motion Map (MV-RAMDMM)**: Enhances adaptability to different viewpoints and time scales through depth motion maps of multiple viewpoints and different temporal resolutions, and highlights important motion regions through region adaptive weighting. - **Multi-stream 3D CNN**: Trains 3D CNNs on different viewpoints and temporal resolutions to extract multi-modal features. - **Multi-temporal Resolution RGB Information**: Utilizes RGB images of different temporal resolutions to capture the appearance information of actions, especially for actions involving object interactions. - **Average Score Fusion**: Fuses the output results of different viewpoints, temporal resolutions, and modalities to improve overall recognition performance. Through these techniques, the system can achieve better recognition performance than existing methods on various public datasets, especially in cross-view and cross-subject recognition tasks.

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

CNN-BASED ACTION RECOGNITION USING ADAPTIVE MULTISCALE DEPTH MOTION MAPS AND STABLE JOINT DISTANCE MAPS

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Human Action Recognition Using Deep Learning Methods.

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Deep learning-based multi-view 3D-human action recognition using skeleton and depth data

Action Recognition from Depth Maps Using Deep Convolutional Neural Networks

Combining Multi-scale Directed Depth Motion Maps and Log-Gabor Filters for Human Action Recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Multiple Stream Deep Learning Model for Human Action Recognition

Temporal Cues Enhanced Multimodal Learning for Action Recognition in RGB-D Videos

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Multi-Temporal Depth Motion Maps-Based Local Binary Patterns for 3-D Human Action Recognition

Action Recognition for Depth Video using Multi-view Dynamic Images

Convnets-Based Action Recognition From Depth Maps Through Virtual Cameras And Pseudocoloring

Multi-view Multi-modal Approach Based on 5S-CNN and BiLSTM Using Skeleton, Depth and RGB Data for Human Activity Recognition