Abstract:Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: <a class="link-external link-https" href="https://github.com/Tang1705/SeeClear-NeurIPS24" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in video super - resolution (VSR): 1. **Detail Consistency Preservation**: Video super - resolution methods based on diffusion models perform well in generating perceptually realistic videos, but have difficulty in maintaining detail consistency between frames. This is due to random fluctuations. 2. **Ineffectiveness of Pixel - level Alignment**: Traditional pixel - level alignment methods are ineffective for frames processed by diffusion, because these frames are interfered during the iteration process, resulting in inaccurate alignment. To solve these problems, the paper proposes the SeeClear framework, which is a novel VSR framework based on conditional video generation and enhances pixel condensation through instance - centric and channel - dimension semantic control. Specifically, SeeClear introduces the following innovations: - **Semantic Distiller and Pixel Condenser**: These two modules work together to extract and amplify semantic details from low - resolution frames. - **Instance - Centric Alignment Module (InCAM)**: Utilizes video - segment - level tokens to dynamically correlate pixels within frames and improve consistency. - **Channel - wise Texture Aggregation Memory (CaTeGory)**: Introduces external knowledge and utilizes long - existing semantic textures to enhance global temporal consistency. - **ResShift Mechanism in Blurry Diffusion Process**: Introduces residual shift in the diffusion process to balance sharpness and diffusion effects. Through these innovations, the SeeClear framework can significantly improve the effect of video super - resolution while maintaining detail consistency between frames. Experimental results show that SeeClear outperforms existing diffusion - model - based VSR techniques on multiple metrics. ### Formula Summary 1. **Forward Diffusion Process**: \[ q(u_t|u_0)=\mathcal{N}(u_t|D_tu_0,\eta_tE),\quad t\in\{1,\dots,T\} \] where \(u_0 = V^TI_{HR}^i\), \(V^T\) represents the DCT projection matrix, \(D_t = e^{\Lambda_t}\) is a diagonal blurring matrix, \(\eta_t\) is the noise variance, and \(E\) is the identity matrix. 2. **Forward Diffusion Process with Residual Shift**: \[ q(u_t|u_0,u_l)=\mathcal{N}(u_t|D_tu_0+\eta_te_t,\kappa^2\eta_tE),\quad t\in\{1,\dots,T\} \] where \(e_t = u_l - D_tu_0\), \(u_l\) is the representation of the low - resolution frame in the frequency domain, \(\eta_t\) is the shift sequence, and \(\kappa\) is a hyperparameter of noise intensity. 3. **Reverse Sampling Process**: \[ p(u_0|u_l)=\int p(u_T|u_l)\prod_{t = 1}^T p_\theta(u_{t-1}|u_t,u_l)du_{1:T} \] where \(p(u_T|u_l)\approx\mathcal{N}(u_T|u_l,\kappa^2E)\), and \(p_\theta(u_{t-1}|u_t,u_l)\) is the inverse transformation kernel for recovering \(u_t\) to \(u_{t-1}\). 4. **Discrete Wavelet Transform (DWT)**: \[ I_{HR}^{ll},I_{HR}^{lh},I_{HR}^{

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Towards Interpretable Video Super-Resolution Via Alternating Optimization

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

DIVD: Deblurring with Improved Video Diffusion Model

AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

Self-Learned Video Super-Resolution with Augmented Spatial and Temporal Context

ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution

AsConvSR: Fast and Lightweight Super-Resolution Network with Assembled Convolutions

ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

A Conditional Diffusion Model With Fast Sampling Strategy for Remote Sensing Image Super-Resolution

Decoder-side Cross Resolution Synthesis for Video Compression Enhancement

DeeDSR: Towards Real-World Image Super-Resolution via Degradation-Aware Stable Diffusion

TempDiff: Enhancing Temporal‐awareness in Latent Diffusion for Real‐World Video Super‐Resolution

Video Super-Resolution Via a Spatio-Temporal Alignment Network.

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution