Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Mahmoud Al-Faris,John P. Chiverton,Yanyan Yang,David L. Ndzi
DOI: https://doi.org/10.48550/arXiv.1904.06074
2019-04-12
Abstract:Human action recognition remains an important yet challenging task. This work proposes a novel action recognition system. It uses a novel Multiple View Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM) formulation combined with appearance information. Multiple stream 3D Convolutional Neural Networks (CNNs) are trained on the different views and time resolutions of the region adaptive Depth Motion Maps. Multiple views are synthesised to enhance the view invariance. The region adaptive weights, based on localised motion, accentuate and differentiate parts of actions possessing faster motion. Dedicated 3D CNN streams for multi-time resolution appearance information (RGB) are also included. These help to identify and differentiate between small object interactions. A pre-trained 3D-CNN is used here with fine-tuning for each stream along with multiple class Support Vector Machines (SVM)s. Average score fusion is used on the output. The developed approach is capable of recognising both human action and human-object interaction. Three public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view actions and MSR 3D daily activity are used to evaluate the proposed solution. The experimental results demonstrate the robustness of this approach compared with state-of-the-art algorithms.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address several key challenges in human action recognition: 1. **Multi-view Adaptability**: Robustness and accuracy of action recognition under different viewpoints. Traditional action recognition methods are usually sensitive to viewpoint changes, leading to decreased recognition performance under different viewpoints. 2. **Multi-temporal Resolution**: Differences in the performance of different actions at different time scales. Some actions may be completed in a short time, while others require a longer duration. Therefore, methods with a single temporal resolution are difficult to adapt to all types of actions. 3. **Local Motion Weighting**: In different actions, the motion in certain regions is more important. Traditional methods often fail to effectively distinguish these important motion regions, resulting in decreased recognition accuracy. 4. **Utilization of Appearance Information**: In addition to depth information, appearance information (such as RGB images) is also very important for recognizing certain types of actions, especially those involving object interactions. Traditional methods often rely solely on depth information, ignoring the value of appearance information. To address these issues, the paper proposes a new action recognition system that combines multi-view, multi-temporal resolution depth motion maps (DMM) and appearance information (RGB), and trains through multi-stream 3D convolutional neural networks (3D CNN). Specifically, the system includes the following components: - **Multi-view Region Adaptive Multi-temporal Resolution Depth Motion Map (MV-RAMDMM)**: Enhances adaptability to different viewpoints and time scales through depth motion maps of multiple viewpoints and different temporal resolutions, and highlights important motion regions through region adaptive weighting. - **Multi-stream 3D CNN**: Trains 3D CNNs on different viewpoints and temporal resolutions to extract multi-modal features. - **Multi-temporal Resolution RGB Information**: Utilizes RGB images of different temporal resolutions to capture the appearance information of actions, especially for actions involving object interactions. - **Average Score Fusion**: Fuses the output results of different viewpoints, temporal resolutions, and modalities to improve overall recognition performance. Through these techniques, the system can achieve better recognition performance than existing methods on various public datasets, especially in cross-view and cross-subject recognition tasks.