Abstract:While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNNfeatures are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity-that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity-that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

Mitigating imbalances in heterogeneous feature fusion for multi-class 6D pose estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

3D Point-to-Keypoint Voting Network for 6D Pose Estimation

A Lightweight Color and Geometry Feature Extraction and Fusion Module for End-to-end 6D Pose Estimation

A Transformer-based multi-modal fusion network for 6D pose estimation

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

FEIF: Feature Excitation and Interactive Fusion for 6D Object Pose Estimation.

Robust Classification and 6D Pose Estimation by Sensor Dual Fusion of Image and Point Cloud Data

LHFF-Net: Local heterogeneous feature fusion network for 6DoF pose estimation

A modal fusion network with dual attention mechanism for 6D pose estimation

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

A Pose Estimation Algorithm for Multimodal Data Fusion

Estimating 6D Object Poses with Temporal Motion Reasoning for Robot Grasping in Cluttered Scenes

Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy

HFF6D: Hierarchical Feature Fusion Network for Robust 6D Object Pose Tracking

FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond