TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

Anh-Dzung Doan,Vu Minh Hieu Phan,Surabhi Gupta,Markus Wagner,Tat-Jun Chin,Ian Reid
2024-08-26
Abstract:Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at <a class="link-external link-https" href="https://github.com/dzungdoan6/tc-pdm" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key challenges in infrared - to - visible (I2V) conversion: 1. **Preservation of semantic structure**: - Although infrared images have high robustness in harsh environments, their detailed information is far less rich than that of visible - light images. Existing patch diffusion models are prone to cause semantic deformation of small objects when generating visible - light images, especially in complex scenes. This deformation will affect the performance of downstream tasks (such as object detection). 2. **Temporal consistency**: - In video data, directly applying existing image - level I2V conversion methods will lead to unsmooth transitions between frames, and the generated videos look unrealistic. Therefore, ensuring temporal consistency between consecutive frames is crucial. To solve these problems, the authors propose a new diffusion model - **Temporally Consistent Patch Diffusion Models (TC - PDM)**. This model addresses the above challenges through the following two innovations: 1. **Semantic - guided denoising**: - Use a pre - trained basic segmentation model to extract the semantic segmentation map of the infrared image and use it as a conditional input to the diffusion model. This helps to maintain the semantic structure of the scene when generating visible - light images. 2. **Temporal blending module**: - Use optical flow to estimate the correspondence between the front and rear frames and guide the direction of the denoising trajectory, thereby ensuring the temporal consistency between the generated consecutive frames. The experimental results show that TC - PDM significantly outperforms existing methods in I2V video conversion and night - time object detection tasks. Specifically, on the FVD (Fréchet Video Distance) metric, TC - PDM is 35.3% better than existing methods, and on the AP50 (Average Precision at 50% IoU) metric, it is 6.1% better. In summary, the main contribution of this paper is to propose a new diffusion model framework. By introducing semantic conditions and temporal consistency mechanisms, it significantly improves the quality of I2V video conversion and the performance of downstream tasks.