Abstract:Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at <a class="link-external link-https" href="https://github.com/dzungdoan6/tc-pdm" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key challenges in infrared - to - visible (I2V) conversion: 1. **Preservation of semantic structure**: - Although infrared images have high robustness in harsh environments, their detailed information is far less rich than that of visible - light images. Existing patch diffusion models are prone to cause semantic deformation of small objects when generating visible - light images, especially in complex scenes. This deformation will affect the performance of downstream tasks (such as object detection). 2. **Temporal consistency**: - In video data, directly applying existing image - level I2V conversion methods will lead to unsmooth transitions between frames, and the generated videos look unrealistic. Therefore, ensuring temporal consistency between consecutive frames is crucial. To solve these problems, the authors propose a new diffusion model - **Temporally Consistent Patch Diffusion Models (TC - PDM)**. This model addresses the above challenges through the following two innovations: 1. **Semantic - guided denoising**: - Use a pre - trained basic segmentation model to extract the semantic segmentation map of the infrared image and use it as a conditional input to the diffusion model. This helps to maintain the semantic structure of the scene when generating visible - light images. 2. **Temporal blending module**: - Use optical flow to estimate the correspondence between the front and rear frames and guide the direction of the denoising trajectory, thereby ensuring the temporal consistency between the generated consecutive frames. The experimental results show that TC - PDM significantly outperforms existing methods in I2V video conversion and night - time object detection tasks. Specifically, on the FVD (Fréchet Video Distance) metric, TC - PDM is 35.3% better than existing methods, and on the AP50 (Average Precision at 50% IoU) metric, it is 6.1% better. In summary, the main contribution of this paper is to propose a new diffusion model framework. By introducing semantic conditions and temporal consistency mechanisms, it significantly improves the quality of I2V video conversion and the performance of downstream tasks.

TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

T2V-DDPM: Thermal to Visible Face Translation using Denoising Diffusion Probabilistic Models

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

PID: Physics-Informed Diffusion Model for Infrared Image Generation

A Diffusion Model Translator for Efficient Image-to-Image Translation

VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification

Video Demoiréing with Deep Temporal Color Embedding and Video-Image Invertible Consistency

Dif-Fusion: Toward High Color Fidelity in Infrared and Visible Image Fusion With Diffusion Models

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

A Modular Conditional Diffusion Framework for Image Reconstruction

TempDiff: Enhancing Temporal‐awareness in Latent Diffusion for Real‐World Video Super‐Resolution

Dif-Fusion: Towards High Color Fidelity in Infrared and Visible Image Fusion with Diffusion Models

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

PixRevive: Latent Feature Diffusion Model for Compressed Video Quality Enhancement

CPNet: Continuity Preservation Network for infrared video colorization

Nighttime Thermal Infrared Image Translation Integrating Visible Images

Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning

Multi-Sensor Diffusion-Driven Optical Image Translation for Large-Scale Applications