Abstract:To overcome the limitations of single modal object detection,this study proposes a dual modal feature fusion object detection algorithm based on CNN-Transformer.By fully leveraging the clear contour information from infrared images and the rich detail information from visible light images,integrating the complementary information of both infrared and visible light modalities significantly enhances object detection performance and extends its applicability to more complex real-world scenarios.The innovation of the proposed algorithm lies in the construction of a dual-stream feature extraction network,which can simultaneously process information from infrared and visible light images.Since infrared images have clear contour information that can be used to guide object localization,we adopt a CNN-based Feature Extraction(CFE)module to process infrared images to better capture the location information of the object in the infrared images and improve the expressive power of the feature information.On the other hand,visible light images usually contain rich detail information such as color and texture,which are distributed in the whole image,so we adopt Transformer-based Feature Extraction(TFE)to process visible light images to better capture the global context information and detail information of visible light images.This differentiated feature extraction strategy helps to fully utilize the advantages of both infrared and visible light modal images,enabling the algorithm to better adapt to object detection tasks in different scenes and conditions.In addition,we introduce a dual-modal feature fusion module,which successfully fuses feature information from both modalities through effective inter-modal information interaction.The design of this module not only preserves the features of the original two modalities,but also realizes the inter-modal feature complementarity,which enhances the expression ability of the object features and further improves the performance of multimodal object detection.To validate the effectiveness of the algorithm,we conducted extensive experimental evaluations on three different datasets,including the publicly available datasets KAIST,FLIR,and the GIR dataset that we created in-house.These datasets contain multimodal image pairs,i.e.,infrared images,visible light images,and infrared-visible light image pairs.We trained and tested these multimodal images to evaluate the applicability and performance of the algorithm in various situations.The experimental results indicate that the detection accuracy of the proposed algorithm for dual modal images on the KAIST dataset is 5.7%and 17.4%higher than that of the baseline algorithm for infrared and visible light images respectively.Similarly,on the FLIR dataset,the detection accuracy for infrared and visible light images increased by 11.6%and 17.1%,respectively,when compared to the baseline algorithm.On our self-constructed GIR dataset,the proposed algorithm also demonstrates a notable enhancement in detection accuracy.Additionally,the algorithm has the capability to independently process either infrared or visible light images,with a significant improvement in detection accuracy compared to the baseline algorithm for both modalities.This further validates the applicability and robustness of the proposed algorithm.Moreover,the proposed algorithm also supports the flexibility of displaying the detection results of visible or infrared images during the visualization process to meet different needs.

Bimodal Information Fusion Network for Salient Object Detection based on Transformer

Dual-Branch Feature Fusion Network for Salient Object Detection

Transformers and CNNs Fusion Network for Salient Object Detection

Transformer-Based Cross-Modal Integration Network for RGB-T Salient Object Detection

Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

DFTR: Depth-supervised Fusion Transformer for Salient Object Detection

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

Salient object detection with dual-branch stepwise feature fusion and edge refinement

Transformer-based Network for RGB-D Saliency Detection

MFCINet: multi-level feature and context information fusion network for RGB-D salient object detection

Object Detection Algorithm Based on CNN-Transformer Dual Modal Feature Fusion

ETFormer: an Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection

TranSal: Depth-guided Transformer for RGB-D Salient Object Detection

Transformer Transforms Salient Object Detection and Camouflaged Object Detection

Fusion-Embedding Siamese Network for Light Field Salient Object Detection

SSTNet: Saliency sparse transformers network with tokenized dilation for salient object detection

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection