Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Alignment and Fusion Using Distinct Sensor Data for Multimodal Aerial Scene Classification

Deep Feature Fusion for High-Resolution Aerial Scene Classification

Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification

Aerial Scene Classification Via Multilevel Fusion Based on Deep Convolutional Neural Networks.

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Radar and Camera Fusion for Multi-Task Sensing in Autonomous Driving

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

A multimodal fusion framework for urban scene understanding and functional identification using geospatial data

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Learning rich multimodal representation for robust land cover classification in fog

Multi-Modal Domain Fusion for Multi-modal Aerial View Object Classification

Robust-FusionNet: Deep Multimodal Sensor Fusion for 3-D Object Detection Under Severe Weather Conditions

Multi-Modality Cascaded Fusion Technology for Autonomous Driving

Learning SAR-Optical Cross Modal Features for Land Cover Classification

A Cross-Modal Semantic Alignment and Feature Fusion Method for Bionic Drone and Bird Recognition

Deep Multimodal Data Fusion

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Multimodal Fusion Strategies for Mapping Biophysical Landscape Features

Hierarchical Attention and Parallel Filter Fusion Network for Multisource Data Classification

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Adaptive multimodal feature fusion with frequency domain gate for remote sensing object detection