Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Online video visual relation detection with hierarchical multi-modal fusion

Video Visual Relation Detection Via Multi-modal Feature Fusion

Visual relationship detection with a deep convolutional relationship network

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Visual Relationship Detection with Multimodal Fusion and Reasoning

Video Relation Detection with Spatio-Temporal Graph

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Multi-view scene matching with relation aware feature perception

Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and Beyond

Video Relation Detection Via Multiple Hypothesis Association.

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

A Multimodal Approach for Multiple-Relation Extraction in Videos

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images

A multi-modal fusion approach for measuring web video relatedness