Abstract:Abstract Fusion of multiple modalities from different sensors is an important area of research for multimodal human action recognition. In this paper, we conduct an in-depth study to investigate the effect of different parameters like input preprocessing, data augmentation, network architectures and model fusion so as to come up with a practical guideline for multimodal action recognition using deep learning paradigm. First, for RGB videos, we propose a novel image-based descriptor called stacked dense flow difference image (SDFDI), capable of capturing the spatio-temporal information present in a video sequence. A variety of deep 2D convolutional neural networks (CNN) are then trained to compare our SDFDI against state-of-the-art image-based representations. Second, for skeleton stream, we propose data augmentation technique based on 3D transformations so as to facilitate training a deep neural network on small datasets. We also propose a bidirectional gated recurrent unit (BiGRU) based recurrent neural network (RNN) to model skeleton data. Third, for inertial sensor data, we propose data augmentation based on jittering with white Gaussian noise along with deep a 1D-CNN network for action classification. The outputs of all these three heterogeneous networks (1D-CNN, 2D-CNN and BiGRU) are combined by a variety of model fusion approach based on score and feature fusion. Finally, in order to illustrate the efficacy of the proposed framework, we test our model on a publicly available UTD-MHAD dataset, and achieved an overall accuracy of 97.91%, which is about 4% higher than using each modality individually. We hope that the discussions and conclusions from this work will provide a deeper insight to the researchers in the related fields, and provide avenues for further studies for different multi-sensor based fusion architectures.

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data

Inertial Sensor Data To Image Encoding For Human Action Recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Human-centric multimodal fusion network for robust action recognition

Multi-modality Fusion Network for Action Recognition.

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

DMMs-Based Multiple Features Fusion for Human Action Recognition

Fusion-GCN: Multimodal Action Recognition using Graph Convolutional Networks

Multimodal fusion for audio-image and video action recognition

Hand-crafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition

Multimodal human action recognition based on spatio-temporal action representation recognition model

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

A Novel Two Stream Decision Level Fusion of Vision and Inertial Sensors Data for Automatic Multimodal Human Activity Recognition System

A Multimodal Information Fusion Model for Robot Action Recognition with Time Series

Sensor fusion based manipulative action recognition