Abstract:In this paper, we present a method (Action-Fusion) for human action recognition from depth maps and posture data using convolutional neural networks (CNNs). Two input descriptors are used for action representation. The first input is a depth motion image that accumulates consecutive depth maps of a human action, whilst the second input is a proposed moving joints descriptor which represents the motion of body joints over time. In order to maximize feature extraction for accurate action classification, three CNN channels are trained with different inputs. The first channel is trained with depth motion images (DMIs), the second channel is trained with both DMIs and moving joint descriptors together, and the third channel is trained with moving joint descriptors only. The action predictions generated from the three CNN channels are fused together for the final action classification. We propose several fusion score operations to maximize the score of the right action. The experiments show that the results of fusing the output of three channels are better than using one channel or fusing two channels only. Our proposed method was evaluated on three public datasets: 1) Microsoft action 3-D dataset (MSRAction3D); 2) University of Texas at Dallas-multimodal human action dataset; and 3) multimodal action dataset (MAD) dataset. The testing results indicate that the proposed approach outperforms most of existing state-of-the-art methods, such as histogram of oriented 4-D normals and Actionlet on MSRAction3D. Although MAD dataset contains a high number of actions (35 actions) compared to existing action RGB-D datasets, this paper surpasses a state-of-the-art method on the dataset by 6.84%.

Fusion of Skeletal and STIP-Based Features for Action Recognition with RGB-D Devices.

3D Human Activity Recognition Using Skeletal Data from RGBD Sensors.

3d Human Action Recognition Based On The Spatial-Temporal Moving Skeleton Descriptor

Fusion of Skeleton and RGB Features for RGB-D Human Action Recognition

Action Recognition Based on Fusion Skeleton of Two Kinect Sensors

Distributed RGBD Camera Network for 3D Human Pose Estimation and Action Recognition.

Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

Joints-Centered Spatial-Temporal Features Fused Skeleton Convolution Network for Action Recognition

Combining Adaptive Hierarchical Depth Motion Maps with Skeletal Joints for Human Action Recognition

Local feature coding for action recognition using RGB-D camera

Action Recognition Based on 3D Skeleton and RGB Frame Fusion

Spatio-temporal cuboid pyramid for action recognition using depth motion sequences

Fusing Shape and Motion Matrices for View Invariant Action Recognition Using 3D Skeletons

Infrared and 3D skeleton feature fusion for RGB-D action recognition

Activity Recognition from RGB-D Camera with 3D Local Spatio-temporal Features

Using a Selective Ensemble Support Vector Machine to Fuse Multimodal Features for Human Action Recognition

An effective representation for action recognition with human skeleton joints

Effective Human Action Recognition Using Global and Local Offsets of Skeleton Joints.

Human-centric multimodal fusion network for robust action recognition

Deep Convolutional Neural Networks for Human Action Recognition Using Depth Maps and Postures

Feature Fusion of Triaxial Acceleration Signals and Depth Maps for Human Action Recognition