Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Multi-modality Fusion Network for Action Recognition.

DFN: A deep fusion network for flexible single and multi-modal action recognition

Fusion-GCN: Multimodal Action Recognition using Graph Convolutional Networks

Human-centric multimodal fusion network for robust action recognition

MSF-Net: A Multilevel Spatiotemporal Feature Fusion Network Combines Attention for Action Recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

M^3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

DC3D: A Video Action Recognition Network Based on Dense Connection

Symmetrical Enhanced Fusion Network for Skeleton-Based Action Recognition

A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Multi-scale residual network model combined with Global Average Pooling for action recognition

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data