Fusion-Mamba for Cross-modality Object Detection

Wenhao Dong,Haodong Zhu,Shaohui Lin,Xiaoyan Luo,Yunhang Shen,Xuhui Liu,Juan Zhang,Guodong Guo,Baochang Zhang
2024-04-14
Abstract:Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of feature fusion in cross-modal object detection, particularly the effective fusion between infrared (IR) images and visible light (RGB) images. Specifically, the paper points out that existing fusion strategies often overlook the differences between different modalities (such as camera focal length, placement, and shooting angles), leading to poor cross-modal fusion performance. To solve this problem, the authors propose a method called Fusion-Mamba, which centers on utilizing an improved Mamba structure to construct a hidden state space, thereby associating different modal features and reducing the differences between them. This method is implemented through two key modules: the State Space Channel Swapping (SSCS) module for shallow feature fusion, and the Dual State Space Fusion (DSSF) module for deep feature fusion in the hidden state space. In terms of technical details, Fusion-Mamba designs a Fusion-Mamba Block (FMB) that includes the aforementioned two modules, aiming to map cross-modal features into the hidden state space for interaction, thereby reducing the differences between these features and enhancing the consistency of the fused feature representation. Experimental results show that Fusion-Mamba achieves significantly better performance than existing methods on multiple public datasets, especially on datasets pairing infrared and visible light images such as LLVIP, M3FD, and FLIR-Aligned. In summary, the goal of this paper is to improve the effectiveness and consistency of different modal feature fusion in cross-modal object detection tasks to achieve better detection performance.