Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Convolutional Fisher Kernels for RGB-D Object Recognition

Hybrid RGB-D object recognition using Convolutional Neural Network and Fisher Vector

Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network.

Recurrent Convolutional Fusion for RGB-D Object Recognition

Convolutional Neural Network for 3D Object Recognition Based on RGB-D Dataset

Multimodal deep learning for robust RGB-D object recognition

Multiple Classifiers-Based Feature Fusion For Rgb-D Object Recognition

RGB-D Scene Recognition Via Spatial-Related Multi-Modal Feature Learning

New RGB-D features for object recognition on kernel view

Hand-Crafted Features or Machine Learnt Features? Together They Improve RGB-D Object Recognition

Fusing Deep Features by Kernel Collaborative Representation for Remote Sensing Scene Classification

C $^{2}$ DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection

Two-Level Attention-based Fusion Learning for RGB-D Face Recognition

RGB-D Object Recognition Via Incorporating Latent Data Structure and Prior Knowledge

A Complementary Fusion Strategy for RGB-D Face Recognition

C<SUP>2</SUP>DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection

C<inline-formula><tex-math notation="LaTeX">$^{2}$</tex-math></inline-formula>DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection

Rgb-D Object Recognition Using The Knowledge Transferred From Relevant Rgb Images

Confidence-Aware RGB-D Face Recognition Via Virtual Depth Synthesis

A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond