Abstract:Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at <a class="link-external link-https" href="https://github.com/zyszxhy/FoRA" rel="external noopener nofollow">this https URL</a>. Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA.

What problem does this paper attempt to address?

This paper attempts to solve the performance degradation problem in multi - modal object detection caused by large data distribution deviation between modalities and excessive increase in the number of parameters. Specifically: 1. **Problems of existing methods**: - **Two - stream backbone networks**: When dealing with multi - modal information, these methods lead to complex model design and redundant parameters due to the complex fusion between modalities and the increase in a large number of parameters. - **Data distribution biases**: There is a large distribution deviation in the homogeneous information between different modalities (such as visible light and infrared images), which makes the feature extraction inconsistent enough and affects the detection performance. 2. **Proposed methods**: - **Low - rank Modal Adaptors (LMA)**: The author proposes a new multi - modal object detector LMA, which reduces the distribution deviation of homogeneous information and focuses on the extraction of heterogeneous information by sharing the backbone network and using lightweight modal adaptors. - **Adaptive rank allocation strategy**: In order to adapt to the heterogeneity changes of different feature layers, an importance - aware - based training strategy is designed to dynamically allocate the rank of the adapter matrix, so as to achieve better performance under a limited parameter budget. 3. **Main contributions**: - A new multi - modal object detector LMA, which combines a shared backbone network and lightweight modal adaptors, is proposed. It effectively reduces the distribution deviation of multi - modal features and improves the detection performance. - An adaptive matrix rank allocation strategy is designed, which can dynamically adjust the number of parameters according to the importance of the feature layer, further optimizing the computational cost and performance of the model. - Experiments are carried out on two multi - modal object detection benchmark datasets, DroneVehicle and LLVIP. The results show that LMA not only significantly improves the detection accuracy (for example, mAP@0.5 is increased by 10.4% on the DroneVehicle dataset), but also greatly reduces the number of parameters (reducing the parameter increment by about 1000 times compared with existing methods). Through these improvements, LMA effectively solves the key problems in multi - modal object detection and provides a more efficient and lightweight solution.

FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network

Contribution-Based Multi-Stream Feature Distance Fusion Method With <inline-formula> <tex-math notation="LaTeX">${k}$ </tex-math></inline-formula>-Distribution Re-Ranking for Person Re-Identification

Low-Rank Multimodal Remote Sensing Object Detection With Frequency Filtering Experts

Low-Rank Adaption on Transformer-based Oriented Object Detector for Satellite Onboard Processing of Remote Sensing Images

CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Multi-Branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for accurate object detection

MLFA: Toward Realistic Test Time Adaptive Object Detection by Multi-Level Feature Alignment

Mixture-of-Subspaces in Low-Rank Adaptation

FA-YOLO: Research On Efficient Feature Selection YOLO Improved Algorithm Based On FMDS and AGMF Modules

AFTR: A Robustness Multi-Sensor Fusion Model for 3D Object Detection Based on Adaptive Fusion Transformer

ACDF-YOLO: Attentive and Cross-Differential Fusion Network for Multimodal Remote Sensing Object Detection

MOS: A Low Latency and Lightweight Framework for Face Detection, Landmark Localization, and Head Pose Estimation

MSFFAL: Few-Shot Object Detection via Multi-Scale Feature Fusion and Attentive Learning

Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

FLoCoRA: Federated learning compression with low-rank adaptation

A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images

AFANet: A Multibackbone Compatible Feature Fusion Framework for Effective Remote Sensing Object Detection

AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning