Zero-shot Video Restoration and Enhancement Using Pre-Trained Image Diffusion Model

Cong Cao,Huanjing Yue,Xin Liu,Jingyu Yang
2024-07-02
Abstract:Diffusion-based zero-shot image restoration and enhancement models have achieved great success in various image restoration and enhancement tasks without training. However, directly applying them to video restoration and enhancement results in severe temporal flickering artifacts. In this paper, we propose the first framework for zero-shot video restoration and enhancement based on a pre-trained image diffusion model. By replacing the self-attention layer with the proposed cross-previous-frame attention layer, the pre-trained image diffusion model can take advantage of the temporal correlation between neighboring frames. We further propose temporal consistency guidance, spatial-temporal noise sharing, and an early stopping sampling strategy for better temporally consistent sampling. Our method is a plug-and-play module that can be inserted into any diffusion-based zero-shot image restoration or enhancement methods to further improve their performance. Experimental results demonstrate the superiority of our proposed method in producing temporally consistent videos with better fidelity.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the zero-shot tasks in video restoration and enhancement. Specifically, the authors propose a framework based on a pre-trained image diffusion model for zero-shot video restoration and enhancement. Existing zero-shot image restoration and enhancement methods based on diffusion models produce severe temporal flickering artifacts when applied to videos. To solve this problem, the authors propose the following key techniques: 1. **Cross-Previous-Frame Attention**: Enhances temporal consistency by utilizing information from the previous frame through replacing the self-attention layer. 2. **Temporal Consistency Guidance**: Guides the generation process by calculating optical flow and occlusion masks to maintain temporal consistency. 3. **Spatial-Temporal Noise Sharing**: Shares noise between different frames to reduce temporal flickering. 4. **Early Stopping Sampling Strategy**: Stops sampling early in the reverse diffusion process to avoid generating high-frequency noise in the later stages. These techniques work together to enable the pre-trained image diffusion model to maintain good temporal consistency in video restoration and enhancement tasks, thereby improving the fidelity of the videos. Experimental results show that this method performs excellently in generating temporally consistent videos, especially in tasks such as low-light video enhancement, video super-resolution, video restoration, and video colorization.