Fusion-Mamba for Cross-modality Object Detection

Wenhao Dong,Haodong Zhu,Shaohui Lin,Xiaoyan Luo,Yunhang Shen,Xuhui Liu,Juan Zhang,Guodong Guo,Baochang Zhang

2024-04-14

Abstract:Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of feature fusion in cross-modal object detection, particularly the effective fusion between infrared (IR) images and visible light (RGB) images. Specifically, the paper points out that existing fusion strategies often overlook the differences between different modalities (such as camera focal length, placement, and shooting angles), leading to poor cross-modal fusion performance. To solve this problem, the authors propose a method called Fusion-Mamba, which centers on utilizing an improved Mamba structure to construct a hidden state space, thereby associating different modal features and reducing the differences between them. This method is implemented through two key modules: the State Space Channel Swapping (SSCS) module for shallow feature fusion, and the Dual State Space Fusion (DSSF) module for deep feature fusion in the hidden state space. In terms of technical details, Fusion-Mamba designs a Fusion-Mamba Block (FMB) that includes the aforementioned two modules, aiming to map cross-modal features into the hidden state space for interaction, thereby reducing the differences between these features and enhancing the consistency of the fused feature representation. Experimental results show that Fusion-Mamba achieves significantly better performance than existing methods on multiple public datasets, especially on datasets pairing infrared and visible light images such as LLVIP, M3FD, and FLIR-Aligned. In summary, the goal of this paper is to improve the effectiveness and consistency of different modal feature fusion in cross-modal object detection tasks to achieve better detection performance.

Fusion-Mamba for Cross-modality Object Detection

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

MambaSOD: Dual Mamba-Driven Cross-Modal Fusion Network for RGB-D Salient Object Detection

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

MMDR: A Result Feature Fusion Object Detection Approach for Autonomous System

Mask-Guided Mamba Fusion for Drone-Based Visible-Infrared Vehicle Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

CrossFusion: Interleaving Cross-modal Complementation for Noise-resistant 3D Object Detection

MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Discriminative unimodal feature selection and fusion for RGB-D salient object detection

MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection