Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Heterogeneous Visual Features Fusion via Sparse Multimodal Machine

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Multimodal visual dictionary learning via heterogeneous latent semantic sparse coding

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis

Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

Multimodal image fusion with adaptive joint sparsity model

Heterogeneous feature learning network for multimodal remote sensing image collaborative classification

Integration of Multimodal Features for Video Scene Classification Based on HMM

Weakly paired multimodal fusion using multilayer extreme learning machine

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Learn to Combine Modalities in Multimodal Deep Learning

Optimal Multimodal Fusion for Multimedia Data Analysis

Improving Fine-grained Image Classification with Multimodal Information

Interpretation on Multi-modal Visual Fusion

A unified multimodal classification framework based on deep metric learning

Multimodal Feature Fusion for 3D Shape Recognition and Retrieval.

Countering Modal Redundancy and Heterogeneity: A Self-Correcting Multimodal Fusion

Multimodal Hyperspectral Image Classification via Interconnected Fusion

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification