Abstract:Abstract Fusion of multiple modalities from different sensors is an important area of research for multimodal human action recognition. In this paper, we conduct an in-depth study to investigate the effect of different parameters like input preprocessing, data augmentation, network architectures and model fusion so as to come up with a practical guideline for multimodal action recognition using deep learning paradigm. First, for RGB videos, we propose a novel image-based descriptor called stacked dense flow difference image (SDFDI), capable of capturing the spatio-temporal information present in a video sequence. A variety of deep 2D convolutional neural networks (CNN) are then trained to compare our SDFDI against state-of-the-art image-based representations. Second, for skeleton stream, we propose data augmentation technique based on 3D transformations so as to facilitate training a deep neural network on small datasets. We also propose a bidirectional gated recurrent unit (BiGRU) based recurrent neural network (RNN) to model skeleton data. Third, for inertial sensor data, we propose data augmentation based on jittering with white Gaussian noise along with deep a 1D-CNN network for action classification. The outputs of all these three heterogeneous networks (1D-CNN, 2D-CNN and BiGRU) are combined by a variety of model fusion approach based on score and feature fusion. Finally, in order to illustrate the efficacy of the proposed framework, we test our model on a publicly available UTD-MHAD dataset, and achieved an overall accuracy of 97.91%, which is about 4% higher than using each modality individually. We hope that the discussions and conclusions from this work will provide a deeper insight to the researchers in the related fields, and provide avenues for further studies for different multi-sensor based fusion architectures.

Pillar Networks: Combining parametric with non-parametric methods for action recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Human Action Recognition Using Deep Learning Methods.

Dense Semantics-Assisted Networks For Video Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Multi-modality Fusion Network for Action Recognition.

A Deep Action-Oriented Video Image Classification System for Text Detection and Recognition

Action Representation Using Classifier Decision Boundaries

Learning Hierarchical Video Representation for Action Recognition

VideoCapsuleNet: A Simplified Network for Action Detection

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network

AGPN: Action Granularity Pyramid Network for Video Action Recognition

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition