Abstract:Objective In recent years,considerable attention has been given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images.This algorithm serves as an effective approach for addressing object detection tasks in complex scenes.The process of object detection algorithms can be roughly divided into three stages.The first stage is feature extraction,which aims to extract geometric features from the input data.Next,the extracted features are fed into the neck network for multi-scale feature fusion.Finally,the fused features are input into the detection network to output object detection results.Similarly,dual-modal detection algorithms follow the same process to achieve object localization and classification.The difference lies in the fact that traditional object detection focuses on single-modal visible images,while dual-modal detection focuses on visible and infrared image data.The dual-modal detection algorithm aims to simultaneously utilize information from infrared and visible images.It merges these images to obtain more comprehensive and accurate target information,which enhances the accuracy and robustness of the object detection process.Traditional fusion methods encompass pixel-level fusion and feature-level fusion.Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images,which enhances the contrast and edge information of the targets.Meanwhile,feature-level fusion extracts features from the infrared and visible images and combines them to enhance the representation capability of the targets.However,the feature fusion process of existing dual-modal detection algorithms faces two major issues.First,the feature fusion methods employed are relatively simple,which involves the addition or parallel operation of individual feature elements.Consequently,these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection.Second,the algorithm structure solely focuses on the feature fusion process,which neglects the crucial feature selection process.This deficiency results in the inefficient utilization of valuable features.Method In this study,we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selec-tion to address the two issues mentioned above.Overall,we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone,neck,and detection head components.We select CSPDarkNet53 as the backbone,which possesses an identical structure for visible and infrared image branches.The algorithm incorporates two innovative modules:dynamic fusion layer and dynamic selection layer.The proposed algorithm includes embedding the dynamic fusion layer in the backbone network,which utilizes the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression.Moreover,it employs the dynamic selection layer in the neck network,which uses three attention mechanisms(i.e.,scale,space,and channel)to improve multi-scale feature maps and screen useful features.These mechanisms are implemented with SENet and deformable convolutions.In line with standard prac-tices in target detection algorithms,we utilize the detection head of YOLOv5 to generate detection results.The loss func-tion employed for algorithm training is the combined sum of bounding box regression loss,classification loss,and confi-dence loss,which are implemented with generalized intersection over union,cross entropy,and squared-error functions,respectively.Result In this study,we validate our proposed algorithm through experimental evaluation on three publicly available datasets:FLIR,visible-infrared paired dataset for low-light vision(LLVIP),and vehicle detection in aerial imag-ery(VEDAI).We use the mean average precision(mAP)for evaluation.Compared with the baseline model that adds fea-tures individually,our algorithm achieves improvements of 1.3％,0.6％,and 3.9％in mAP50 scores and 4.6％,2.6％,and 7.5％in mAP75 scores.In addition,our algorithm demonstrates enhancements of 3.2％,2.1％,and 3.1％in mAP scores on the respective datasets,which effectively reduces the probability of object omission and false alarms.Moreover,we conduct ablation experiments on two innovative modules:the dynamic fusion layer and the dynamic selection layer.The complete algorithm model,which incorporates the two layers,achieves the best performance on all three test datasets.This performance validates the effectiveness of our proposed algorithm.We also compare the network model size and computa-tional efficiency of these state-of-the-art algorithms,and experiments show that our algorithm can significantly improve algo-rithm performance while slightly increasing parameter computation.Furthermore,we visualize the attention weight matri-ces of the three dynamic fusion layers in the backbone to better reveal the mechanism of the dynamic fusion layer.The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from visible and infrared images.Conclusion In this study,we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy.This algorithm incorporates two innovative modules:dynamic fusion layer and dynamic selection layer.Through extensive experiments,we demonstrate that our algorithm effectively integrates feature information from visible and infrared image modalities,which enhances the performance of object detection.However,the proposed algorithm has a little increasing computational complexity and requires pre-registration of the input visible and infrared images,which limits some application scenarios of the algorithm.The research on lightweight fusion modules and algo-rithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.

MFDetection: A Highly Generalized Object Detection Network Unified with Multilevel Heterogeneous Image Fusion

Multi-spectral Image Fusion for Moving Object Detection

A Multilevel Fusion Network for 3D Object Detection.

MMDR: A Result Feature Fusion Object Detection Approach for Autonomous System

Adaptive Multilevel Fusion Refinement Network for Object Detection in Remote Sensing Images

DetFusion: A Detection-driven Infrared and Visible Image Fusion Network

MFCINet: multi-level feature and context information fusion network for RGB-D salient object detection

Multi-Modality Image Fusion and Object Detection Based on Semantic Information

Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

M2FNet: Multi-modal Fusion Network for Object Detection from Visible and Thermal Infrared Images

MFF-Net: Multimodal Feature Fusion Network for 3D Object Detection

Object Detection Algorithm Based on Dual-modal Fusion Network

Adaptive Multiscale Feature for Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation

Multifeature Fusion-Based Object Detection for Intelligent Transportation Systems

Infrared-visible Image Object Detection Algorithm Using Feature Dynamic Selection

MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

Modality-inter Fusion and Enhancement Network for Dual-Spectral Object Detection

Multi-View Adaptive Fusion Network for 3D Object Detection

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection