SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

Qi Tang,Yao Zhao,Meiqin Liu,Chao Yao
2024-10-26
Abstract:Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: <a class="link-external link-https" href="https://github.com/Tang1705/SeeClear-NeurIPS24" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in video super - resolution (VSR): 1. **Detail Consistency Preservation**: Video super - resolution methods based on diffusion models perform well in generating perceptually realistic videos, but have difficulty in maintaining detail consistency between frames. This is due to random fluctuations. 2. **Ineffectiveness of Pixel - level Alignment**: Traditional pixel - level alignment methods are ineffective for frames processed by diffusion, because these frames are interfered during the iteration process, resulting in inaccurate alignment. To solve these problems, the paper proposes the SeeClear framework, which is a novel VSR framework based on conditional video generation and enhances pixel condensation through instance - centric and channel - dimension semantic control. Specifically, SeeClear introduces the following innovations: - **Semantic Distiller and Pixel Condenser**: These two modules work together to extract and amplify semantic details from low - resolution frames. - **Instance - Centric Alignment Module (InCAM)**: Utilizes video - segment - level tokens to dynamically correlate pixels within frames and improve consistency. - **Channel - wise Texture Aggregation Memory (CaTeGory)**: Introduces external knowledge and utilizes long - existing semantic textures to enhance global temporal consistency. - **ResShift Mechanism in Blurry Diffusion Process**: Introduces residual shift in the diffusion process to balance sharpness and diffusion effects. Through these innovations, the SeeClear framework can significantly improve the effect of video super - resolution while maintaining detail consistency between frames. Experimental results show that SeeClear outperforms existing diffusion - model - based VSR techniques on multiple metrics. ### Formula Summary 1. **Forward Diffusion Process**: \[ q(u_t|u_0)=\mathcal{N}(u_t|D_tu_0,\eta_tE),\quad t\in\{1,\dots,T\} \] where \(u_0 = V^TI_{HR}^i\), \(V^T\) represents the DCT projection matrix, \(D_t = e^{\Lambda_t}\) is a diagonal blurring matrix, \(\eta_t\) is the noise variance, and \(E\) is the identity matrix. 2. **Forward Diffusion Process with Residual Shift**: \[ q(u_t|u_0,u_l)=\mathcal{N}(u_t|D_tu_0+\eta_te_t,\kappa^2\eta_tE),\quad t\in\{1,\dots,T\} \] where \(e_t = u_l - D_tu_0\), \(u_l\) is the representation of the low - resolution frame in the frequency domain, \(\eta_t\) is the shift sequence, and \(\kappa\) is a hyperparameter of noise intensity. 3. **Reverse Sampling Process**: \[ p(u_0|u_l)=\int p(u_T|u_l)\prod_{t = 1}^T p_\theta(u_{t-1}|u_t,u_l)du_{1:T} \] where \(p(u_T|u_l)\approx\mathcal{N}(u_T|u_l,\kappa^2E)\), and \(p_\theta(u_{t-1}|u_t,u_l)\) is the inverse transformation kernel for recovering \(u_t\) to \(u_{t-1}\). 4. **Discrete Wavelet Transform (DWT)**: \[ I_{HR}^{ll},I_{HR}^{lh},I_{HR}^{