Abstract:Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.

DMM-Pyramid Based Deep Architectures for Action Recognition with Depth Cameras

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

3D Convolutional Neural Network for Action Recognition.

Deep learning-based multi-view 3D-human action recognition using skeleton and depth data

Action Recognition for Depth Video using Multi-view Dynamic Images

3DFCNN: Real-Time Action Recognition using 3D Deep Neural Networks with Raw Depth Information

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition From Digital Videos Based on Deep Learning.

CNN-based and DTW features for human activity recognition on depth maps

Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks

Action Recognition from Depth Sequences Using Weighted Fusion of 2D and 3D Auto-Correlation of Gradients Features

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

Skeleton-based Action Recognition Using LSTM and CNN

Continuous Motion Recognition In Depth Camera Based On Recurrent Neural Networks And Grid-Based Average Depth

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

End-to-end Learning of Deep Convolutional Neural Network for 3D Human Action Recognition

Online Robust Action Recognition Based on a Hierarchical Model

A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks