Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Harnessing Object and Scene Semantics for Large-Scale Video Understanding.

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition

Deep Fusion of Multiple Semantic Cues for Complex Event Recognition

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

Dense Semantics-Assisted Networks For Video Action Recognition

Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks.

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Semantic-Aware Fusion Network Based on Super-Resolution

Learning Semantic Feature Map for Visual Content Recognition

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Fusing Multi-Stream Deep Networks for Video Classification

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

FM-Fusion: Instance-Aware Semantic Mapping Boosted by Vision-Language Foundation Models