Infrared-visible Image Object Detection Algorithm Using Feature Dynamic Selection
Xu Ke,Liu Xinpu,Wang Hanyun,Wan Jianwei,Guo Yulan
DOI: https://doi.org/10.11834/jig.230495
2024-01-01
Journal of Image and Graphics
Abstract:Objective In recent years,considerable attention has been given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images.This algorithm serves as an effective approach for addressing object detection tasks in complex scenes.The process of object detection algorithms can be roughly divided into three stages.The first stage is feature extraction,which aims to extract geometric features from the input data.Next,the extracted features are fed into the neck network for multi-scale feature fusion.Finally,the fused features are input into the detection network to output object detection results.Similarly,dual-modal detection algorithms follow the same process to achieve object localization and classification.The difference lies in the fact that traditional object detection focuses on single-modal visible images,while dual-modal detection focuses on visible and infrared image data.The dual-modal detection algorithm aims to simultaneously utilize information from infrared and visible images.It merges these images to obtain more comprehensive and accurate target information,which enhances the accuracy and robustness of the object detection process.Traditional fusion methods encompass pixel-level fusion and feature-level fusion.Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images,which enhances the contrast and edge information of the targets.Meanwhile,feature-level fusion extracts features from the infrared and visible images and combines them to enhance the representation capability of the targets.However,the feature fusion process of existing dual-modal detection algorithms faces two major issues.First,the feature fusion methods employed are relatively simple,which involves the addition or parallel operation of individual feature elements.Consequently,these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection.Second,the algorithm structure solely focuses on the feature fusion process,which neglects the crucial feature selection process.This deficiency results in the inefficient utilization of valuable features.Method In this study,we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selec-tion to address the two issues mentioned above.Overall,we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone,neck,and detection head components.We select CSPDarkNet53 as the backbone,which possesses an identical structure for visible and infrared image branches.The algorithm incorporates two innovative modules:dynamic fusion layer and dynamic selection layer.The proposed algorithm includes embedding the dynamic fusion layer in the backbone network,which utilizes the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression.Moreover,it employs the dynamic selection layer in the neck network,which uses three attention mechanisms(i.e.,scale,space,and channel)to improve multi-scale feature maps and screen useful features.These mechanisms are implemented with SENet and deformable convolutions.In line with standard prac-tices in target detection algorithms,we utilize the detection head of YOLOv5 to generate detection results.The loss func-tion employed for algorithm training is the combined sum of bounding box regression loss,classification loss,and confi-dence loss,which are implemented with generalized intersection over union,cross entropy,and squared-error functions,respectively.Result In this study,we validate our proposed algorithm through experimental evaluation on three publicly available datasets:FLIR,visible-infrared paired dataset for low-light vision(LLVIP),and vehicle detection in aerial imag-ery(VEDAI).We use the mean average precision(mAP)for evaluation.Compared with the baseline model that adds fea-tures individually,our algorithm achieves improvements of 1.3%,0.6%,and 3.9%in mAP50 scores and 4.6%,2.6%,and 7.5%in mAP75 scores.In addition,our algorithm demonstrates enhancements of 3.2%,2.1%,and 3.1%in mAP scores on the respective datasets,which effectively reduces the probability of object omission and false alarms.Moreover,we conduct ablation experiments on two innovative modules:the dynamic fusion layer and the dynamic selection layer.The complete algorithm model,which incorporates the two layers,achieves the best performance on all three test datasets.This performance validates the effectiveness of our proposed algorithm.We also compare the network model size and computa-tional efficiency of these state-of-the-art algorithms,and experiments show that our algorithm can significantly improve algo-rithm performance while slightly increasing parameter computation.Furthermore,we visualize the attention weight matri-ces of the three dynamic fusion layers in the backbone to better reveal the mechanism of the dynamic fusion layer.The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from visible and infrared images.Conclusion In this study,we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy.This algorithm incorporates two innovative modules:dynamic fusion layer and dynamic selection layer.Through extensive experiments,we demonstrate that our algorithm effectively integrates feature information from visible and infrared image modalities,which enhances the performance of object detection.However,the proposed algorithm has a little increasing computational complexity and requires pre-registration of the input visible and infrared images,which limits some application scenarios of the algorithm.The research on lightweight fusion modules and algo-rithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.