Abstract:Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from a conventional camera and a depth sensor (e.g., Microsoft Kinect). We propose a novel complex activity recognition and localization framework that effectively fuses information from both grayscale and depth image channels at multiple levels of the video processing pipeline. In the individual visual feature detection level, depth-based filters are applied to the detected human/object rectangles to remove false detections. In the next level of interaction modeling, 3-D spatial and temporal contexts among human subjects or objects are extracted by integrating information from both grayscale and depth images. Depth information is also utilized to distinguish different types of indoor scenes. Finally, a latent structural model is developed to integrate the information from multiple levels of video processing for an activity detection. Extensive experiments on two activity recognition benchmarks (one with depth information) and a challenging grayscale + depth human activity database that contains complex interactions between human-human, human-object, and human-surroundings demonstrate the effectiveness of the proposed multilevel grayscale + depth fusion scheme. Higher recognition and localization accuracies are obtained relative to the previous methods.

Multilevel Depth and Image Fusion for Human Activity Detection.

Hierarchical Complex Activity Representation and Recognition Using Topic Model and Classifier Level Fusion.

Depth-based human activity recognition via multi-level fused features and fast broad learning system

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Action Recognition from Depth Sequences Using Weighted Fusion of 2D and 3D Auto-Correlation of Gradients Features

Depth Context: a New Descriptor for Human Activity Recognition by Using Sole Depth Sequences

DMMs-Based Multiple Features Fusion for Human Action Recognition

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Multi-Sensor Data Fusion for Accurate Human Activity Recognition with Deep Learning

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

K-nearest Neighborhood Based Integration of Time-of-flight Cameras and Passive Stereo for High-Accuracy Depth Maps.

A Multimodal Fusion Approach for Human Activity Recognition

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Recognition of Human Activities Using Depth Maps and the Viewpoint Feature Histogram Descriptor

Human-centric multimodal fusion network for robust action recognition

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

DCNN based human activity recognition framework with depth vision guiding

Combining Adaptive Hierarchical Depth Motion Maps with Skeletal Joints for Human Action Recognition

Multi-Sensor Data Fusion and CNN-LSTM Model for Human Activity Recognition System