Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Local Visual Feature Fusion Via Maximum Margin Multimodal Deep Neural Network

Aerial Scene Classification Via Multilevel Fusion Based on Deep Convolutional Neural Networks.

Boosting 3D Point Cloud Registration by Transferring Multi-modality Knowledge

Point Cloud Deep Learning Network Based on Local Domain Multi-Level Feature

A Late Fusion Approach for Harnessing Multi-Cnn Model High-Level Features

Deep Corner

Dual-Neighborhood Deep Fusion Network for Point Cloud Analysis

DMFF: dual-way multimodal feature fusion for 3D object detection

Learning local descriptors with multi-level feature aggregation and spatial context pyramid

Dual-Branch Feature Fusion Network Based Cross-Modal Enhanced CNN and Transformer for Hyperspectral and LiDAR Classification

DMN4: Few-shot Learning Via Discriminative Mutual Nearest Neighbor Neural Network

AFSRNet: learning local descriptors with adaptive multi-scale feature fusion and symmetric regularization

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

MFFNet: Multimodal Feature Fusion Network for Point Cloud Semantic Segmentation

Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition

Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection

A FEATURE EMBEDDING STRATEGY FOR HIGH-LEVEL CNN REPRESENTATIONS FROM MULTIPLE CONVNETS

Multiscale Feature Interactive Network for Multifocus Image Fusion

Interactive Fusion and Correlation Network for Three-Modal Images Few-Shot Semantic Segmentation

Deep Feature Correlation Learning for Multi-Modal Remote Sensing Image Registration

Fusing features of deep convolution neural networks to achieve the scene classification of remote sensing image