Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

Chen Rao,Guangyuan Li,Zehua Lan,Jiakai Sun,Junsheng Luan,Wei Xing,Lei Zhao,Huaizhong Lin,Jianfeng Dong,Dalong Zhang
2024-08-24
Abstract:Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of high - frequency information restoration in video deblurring. Current video deblurring methods have limitations in restoring high - frequency information because the regression loss is relatively conservative for high - frequency details. Diffusion Models (DMs) show strong capabilities in generating high - frequency details, so the author considers introducing them into the video deblurring task. However, there are two main problems in directly applying the diffusion model to the video deblurring task: 1. **High consumption of computational resources**: The diffusion model requires a large number of iterative steps to generate videos from Gaussian noise, which consumes a large amount of computational resources. 2. **Easily misled by blurring artifacts**: The diffusion model is easily misled by motion - blurring artifacts in videos, resulting in unreasonable or distorted video content after deblurring. To solve the above problems, the author proposes a new video deblurring framework VD - Diff, which combines the diffusion model with Wavelet - Aware Dynamic Transformer (WADT). Specifically, they perform the diffusion model in a highly compressed latent space to generate prior features that conform to the real distribution and contain high - frequency information. At the same time, WADT is designed to preserve and restore the low - frequency information in videos and utilize the high - frequency information generated by the diffusion model. Experimental results show that the proposed VD - Diff method outperforms the existing state - of - the - art methods on multiple datasets such as GoPro, DVD, BSD, and Real - World Video. ### Formula Summary - **Formula 1**: Calculation of the highly compact latent prior feature \(z'\) for guiding video restoration: \[ \hat{F}=W_{1}^{l}z'\odot\text{LN}(F_{aa}) + W_{2}^{l}z' \] where \(\odot\) represents element - wise multiplication, \(\text{LN}\) represents layer normalization, and \(W_{l}\) represents a linear layer. - **Formula 2**: Attention calculation formula: \[ F'_{\text{out}}=W_{a}\hat{V}\cdot\text{Softmax}\left(\frac{\hat{K}\cdot\hat{Q}}{\gamma}\right)+F_{aa} \] where \(\gamma\) is a learnable scaling parameter. - **Formula 3**: Overall process of the WDA - FFN part: \[ F_{\text{out}}=G(W_{1}^{2}W_{1}^{3}\hat{F}'_{\text{out}})\odot W_{2}^{2}W_{2}^{3}\hat{F}'_{\text{out}}+\hat{F}'_{\text{out}} \] where \(G\) represents the Gaussian Error Linear Unit (GELU). - **Formula 4**: Gaussian noise conversion in the forward diffusion process: \[ q(z_{t}|z_{t - 1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t - 1},\beta_{t}I) \] - **Formula 5**: Denoising formula in the reverse denoising process: \[ z_{t - 1}=\frac{1}{\sqrt{\alpha_{t}}}\left(z_{t}-\epsilon\sqrt{1-\bar{\alpha}_{t}}\right) \] Through these methods and formulas, VD - Diff effectively solves the problem of high - frequency information restoration in video deblurring and significantly improves the quality of deblurred videos.