Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Scene captioning with deep fusion of images and point clouds

TSFNet: Triple-Steam Image Captioning

Aligning Where to See and What to Tell: Image Caption with Region-Based Attention and Scene Factorization

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Attention-based Visual-Audio Fusion for Video Caption Generation.

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Chinese image captioning with fusion encoder and visual keyword search

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Dynamic-balanced Double-Attention Fusion for Image Captioning

CapsFusion: Rethinking Image-Text Data at Scale

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Integrating both Visual and Audio Cues for Enhanced Video Caption

Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Multimodal feature fusion based on object relation for video captioning

Event-centric multi-modal fusion method for dense video captioning

Simnet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions.

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond