Abstract:In the field of computer vision, visible light images often exhibit low contrast in low-light conditions, presenting a significant challenge. While infrared imagery provides a potential solution, its utilization entails high costs and practical limitations. Recent advancements in deep learning, particularly the deployment of Generative Adversarial Networks (GANs), have facilitated the transformation of visible light images to infrared images. However, these methods often experience unstable training phases and may produce suboptimal outputs. To address these issues, we propose a novel end-to-end Transformer-based model that efficiently converts visible light images into high-fidelity infrared images. Initially, the Texture Mapping Module and Color Perception Adapter collaborate to extract texture and color features from the visible light image. The Dynamic Fusion Aggregation Module subsequently integrates these features. Finally, the transformation into an infrared image is refined through the synergistic action of the Color Perception Adapter and the Enhanced Perception Attention mechanism. Comprehensive benchmarking experiments confirm that our model outperforms existing methods, producing infrared images of markedly superior quality, both qualitatively and quantitatively. Furthermore, the proposed model enables more effective downstream applications for infrared images than other methods.

What problem does this paper attempt to address?

The paper primarily addresses the issue of low contrast in visible light images under low-light conditions by proposing a new end-to-end Transformer-based model called IRFormer, which efficiently converts visible light images into high-fidelity infrared images. The main problems addressed by the paper include: 1. **Limitations of visible light images under low-light conditions**: Visible light images often have low contrast, blurred edges, and lack of details under low-light conditions, posing challenges for applications such as pedestrian detection and autonomous driving. 2. **Cost and limitations of infrared imaging**: Although infrared images can mitigate these limitations, infrared imaging equipment is costly and has certain practical limitations. 3. **Shortcomings of existing methods**: Methods based on Generative Adversarial Networks (GANs) can convert visible light images to infrared images, but they suffer from unstable training and poor output quality, often resulting in artifacts and information loss. To address these issues, the paper presents the following key contributions: 1. **IRFormer model**: This is a novel end-to-end model based on Transformer, capable of extracting features from visible light images and converting them into high-quality infrared images. The model integrates a Color Perception Adapter (CPA), Enhanced Feature Mapping module (EFM), Dynamic Fusion Aggregation module (DFA), and Enhanced Perception Attention mechanism (EPA) to achieve efficient feature extraction and fusion. 2. **Feature extraction and fusion**: Through the CPA and EFM modules, the model can extract RGB information and fine texture details from visible light images. These features are integrated into a latent space through the DFA module, achieving a smooth transition between the visible light and infrared domains. 3. **Information loss compensation**: The EPA module compensates for information loss due to occlusion or low-light conditions through a dual attention mechanism (channel and pixel attention), further enhancing the quality and details of the infrared images. Experimental results show that the IRFormer model outperforms existing visible-to-infrared image conversion methods on multiple datasets, not only improving visual effects but also excelling in quantitative metrics such as PSNR and SSIM. Additionally, the model demonstrates advantages in downstream tasks (e.g., pedestrian detection), proving the practical value of its generated infrared images.

Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model

MFST: Multi-Modal Feature Self-Adaptive Transformer for Infrared and Visible Image Fusion

VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer

Bridging the Invisible and Visible World: Translation between RGB and IR Images through Contour Cycle GAN

Multi-scale attention-based lightweight network with dilated convolutions for infrared and visible image fusion

Infrared Image Generation Based on Visual State Space and Contrastive Learning

UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning

CGTF: Convolution-Guided Transformer for Infrared and Visible Image Fusion

V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation

An Unpaired Thermal Infrared Image Translation Method Using GMA-CycleGAN

Deep Learning Thermal Image Translation for Night Vision Perception

Nighttime Thermal Infrared Image Translation Integrating Visible Images

A robust infrared and visible image fusion framework via multi-receptive-field attention and color visual perception

IR2VI: Enhanced Night Environmental Perception by Unsupervised Thermal Image Translation

SimpliFusion: a simplified infrared and visible image fusion network

Supervised Image Translation from Visible to Infrared Domain for Object Detection

IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes

Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images