Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Multi-Modal Unsupervised Feature Learning For Rgb-D Scene Labeling

Unsupervised Joint Feature Learning and Encoding for RGB-D Scene Labeling

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition.

Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Multimodal Recurrent Neural Networks with Information Transfer Layers for Indoor Scene Labeling

Unsupervised Multimodal Feature Learning for Semantic Image Segmentation

RGB×D: Learning Depth-Weighted RGB Patches for RGB-D Indoor Semantic Segmentation

Semi-supervised Learning for RGB-D Object Recognition.

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

Pedestrian detection with unsupervised multispectral feature learning using deep neural networks

Multi-type and Multi-level Feature Fusion Network for RGBD Indoor Semantic Segmentation

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

Modality And Component Aware Feature Fusion For Rgb-D Scene Classification

Multi-modal Deep Feature Learning for RGB-D Object Detection

Weakly-supervised DCNN for RGB-D Object Recognition in Real-World Applications Which Lack Large-scale Annotated Training Data

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Unsupervised Feature Learning For Rgb-D Image Classification

LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling