Abstract:The use of complementary information, namely depth or thermal information, has shown its benefits to salient object detection (SOD) during recent years. However, the RGB-D or RGB-T SOD problems are currently only solved independently, and most of them directly extract and fuse raw features from backbones. Such methods can he easily restricted by low-quality modality data and redundant cross-modal features. In this work, a unified end-to-end framework is designed to simultaneously analyze RCB-D and RGB-T SOD tasks. Specifically, to effectively tackle multi-modal features, we propose a novel multi-stage and multi-scale fusion network (MMNet), which consists of a cross-modal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the visual color stage doctrine in the human visual system (HVS), the proposed CMFM aims to explore important feature representations in feature response stage, and integrate them into cross-modal features in adversarial combination stage. Moreover, the proposed BMD learns the combination of multilevel cross-modal fused features to capture both local and global information of salient objects, and can further boost the multimodal SOD performance. The proposed unified cross-modality feature analysis framework based on two-stage and multi-scale information fusion can be used for diverse multi-modal SOD tasks. Comprehensive experiments (similar to 92K image-pairs) demonstrate that the proposed method consistently outperforms the other 21 state-of-the-art methods on nine benchmark datasets. This validates that our proposed method can work well on diverse multi-modal SOD tasks with good generalization and robustness, and provides a good multi-modal SOD benchmark.

Visual and textual based multimodal document object detection

M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

HiM: hierarchical multimodal network for document layout analysis

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

MMDR: A Result Feature Fusion Object Detection Approach for Autonomous System

Weakly Aligned Feature Fusion for Multimodal Object Detection

Cross-Modality 3D Object Detection

Multimodal Feature Fusion YOLOv5 for RGB-T Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Multimodal Deep Representation Learning for Video Classification

An object detection algorithm based on infrared-visible dual modal feature fusion

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

Weakly Paired Multimodal Fusion for Object Recognition.

Document Image Object Detection Algorithm Based on Transformer and Mixed-MLP Network

Frustum FusionNet: Amodal 3D Object Detection with Multi-Modal Feature Fusion

Feature Combination Based On Receptive Fields And Cross-Fusion Feature Pyramid For Object Detection

Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation

MFIL-FCOS: A Multi-Scale Fusion and Interactive Learning Method for 2D Object Detection and Remote Sensing Image Detection