Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Multi-modal Feature Fusion for Geographic Image Annotation.

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

A multimodal fusion framework for urban scene understanding and functional identification using geospatial data

CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images

Deep Feature Selection-And-Fusion for RGB-D Semantic Segmentation

Multi-type and Multi-level Feature Fusion Network for RGBD Indoor Semantic Segmentation

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Multimodal Fusion Strategies for Mapping Biophysical Landscape Features

A Multi-Branch Feature Fusion Network for Building Detection in Remote Sensing Images

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition.

A Deep-Learning-Based Multimodal Data Fusion Framework for Urban Region Function Recognition

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

Multi-modal land cover mapping of remote sensing images using pyramid attention and gated fusion networks

Dual-Path Feature Fusion Network for Semantic Segmentation of Remote Sensing Images

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

DMFF: dual-way multimodal feature fusion for 3D object detection