Exploiting Diffusion Prior for Real-World Image Super-Resolution

Jianyi Wang,Zongsheng Yue,Shangchen Zhou,Kelvin C.K. Chan,Chen Change Loy
2024-06-29
Abstract:We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at <a class="link-external link-https" href="https://github.com/IceClear/StableSR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality high - resolution images using pre - trained diffusion models in the image super - resolution (SR) task while maintaining the authenticity and details of the images. Specifically, the authors propose a new method - StableSR, which achieves blind super - resolution by leveraging the prior knowledge in pre - trained text - to - image diffusion models. The main challenge of this method lies in how to overcome the inherent randomness of diffusion models and adapt to input image resolutions of any size while maintaining the high - fidelity of the generated content. ### Main Contributions: 1. **Time - aware Encoder**: In order to achieve high - quality restoration results without changing the pre - trained synthesis model, the authors design a time - aware encoder, which can adaptively adjust features in different diffusion steps, thus providing stronger guidance in the early iterations to maintain fidelity and weakening the guidance in the later iterations to avoid introducing degradation. 2. **Controllable Feature Wrapping Module (CFW)**: To address the fidelity loss caused by the intrinsic randomness of diffusion models, the authors introduce a controllable feature wrapping module, allowing users to balance quality and fidelity by adjusting a scalar value. 3. **Progressive Aggregation Sampling Strategy**: To overcome the limitation of pre - trained diffusion models on fixed sizes, the authors develop a progressive aggregation sampling strategy, enabling the model to adapt to resolutions of any size. By dividing the image into overlapping small pieces and fusing these pieces in each diffusion iteration to smooth the boundaries, a more coherent output is generated. ### Experimental Results: The authors verify the effectiveness of StableSR through a series of experiments, including quantitative comparisons on synthetic and real - world datasets. The experimental results show that StableSR outperforms existing state - of - the - art methods on multiple metrics, especially performing prominently on evaluation metrics such as FID (Fréchet Inception Distance) and CLIP - IQA (CLIP - based Image Quality Assessment). ### Formula Examples: - **Feature Modulation**: \[ \hat{F}_n^{\text{dif}}=(1 + \alpha_n)\odot F_n^{\text{dif}}+\beta_n; \quad \alpha_n, \beta_n = M_\theta(F_n) \] where $\alpha_n$ and $\beta_n$ are affine parameters in SFT, and $M_\theta$ is a small network containing several convolutional layers. - **Color Correction**: \[ y_c=\hat{y}_c-\frac{\mu_c^{\hat{y}}}{\sigma_c^{\hat{y}}}\cdot\sigma_c^x+\mu_c^x \] where $c\in\{r, g, b\}$ represents the color channel, $\mu_c^{\hat{y}}$ and $\sigma_c^{\hat{y}}$ are the mean and standard deviation estimated from the $c$ - th channel of the generated high - resolution image $\hat{y}$ respectively, and $\mu_c^x$ and $\sigma_c^x$ are the mean and standard deviation estimated from the $c$ - th channel of the low - resolution input $x$ respectively. ### Conclusion: StableSR proposes an innovative method that effectively solves the high - fidelity and arbitrary - scale problems in the image super - resolution task by leveraging the generative prior knowledge in pre - trained diffusion models.