Object Detection Algorithm Based on CNN-Transformer Dual Modal Feature Fusion
Yang Chen,Hou Zhiqiang,Li Xinyue,Ma Sugang,Yang Xiaobao
DOI: https://doi.org/10.3788/gzxb20245303.0310001
IF: 0.6
2024-01-01
ACTA PHOTONICA SINICA
Abstract:To overcome the limitations of single modal object detection,this study proposes a dual modal feature fusion object detection algorithm based on CNN-Transformer.By fully leveraging the clear contour information from infrared images and the rich detail information from visible light images,integrating the complementary information of both infrared and visible light modalities significantly enhances object detection performance and extends its applicability to more complex real-world scenarios.The innovation of the proposed algorithm lies in the construction of a dual-stream feature extraction network,which can simultaneously process information from infrared and visible light images.Since infrared images have clear contour information that can be used to guide object localization,we adopt a CNN-based Feature Extraction(CFE)module to process infrared images to better capture the location information of the object in the infrared images and improve the expressive power of the feature information.On the other hand,visible light images usually contain rich detail information such as color and texture,which are distributed in the whole image,so we adopt Transformer-based Feature Extraction(TFE)to process visible light images to better capture the global context information and detail information of visible light images.This differentiated feature extraction strategy helps to fully utilize the advantages of both infrared and visible light modal images,enabling the algorithm to better adapt to object detection tasks in different scenes and conditions.In addition,we introduce a dual-modal feature fusion module,which successfully fuses feature information from both modalities through effective inter-modal information interaction.The design of this module not only preserves the features of the original two modalities,but also realizes the inter-modal feature complementarity,which enhances the expression ability of the object features and further improves the performance of multimodal object detection.To validate the effectiveness of the algorithm,we conducted extensive experimental evaluations on three different datasets,including the publicly available datasets KAIST,FLIR,and the GIR dataset that we created in-house.These datasets contain multimodal image pairs,i.e.,infrared images,visible light images,and infrared-visible light image pairs.We trained and tested these multimodal images to evaluate the applicability and performance of the algorithm in various situations.The experimental results indicate that the detection accuracy of the proposed algorithm for dual modal images on the KAIST dataset is 5.7%and 17.4%higher than that of the baseline algorithm for infrared and visible light images respectively.Similarly,on the FLIR dataset,the detection accuracy for infrared and visible light images increased by 11.6%and 17.1%,respectively,when compared to the baseline algorithm.On our self-constructed GIR dataset,the proposed algorithm also demonstrates a notable enhancement in detection accuracy.Additionally,the algorithm has the capability to independently process either infrared or visible light images,with a significant improvement in detection accuracy compared to the baseline algorithm for both modalities.This further validates the applicability and robustness of the proposed algorithm.Moreover,the proposed algorithm also supports the flexibility of displaying the detection results of visible or infrared images during the visualization process to meet different needs.