Abstract:In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present batch-consistent inversion, an initialization technique that incorporates informative latents from the measurement frame. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a single NVIDIA 4090 GPU. Project page: <a class="link-external link-https" href="https://vision-xl.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the complex spatio - temporal inverse problems in high - resolution videos. Specifically, the author proposes a new framework named VISION - XL, which utilizes latent image diffusion models to achieve high - quality and high - resolution video reconstruction. The paper mainly solves the following key problems: 1. **Computational requirements for high - resolution video processing**: - Processing high - resolution video frames requires a large amount of computational resources. VISION - XL introduces a pseudo - batch consistent sampling strategy, enabling it to run efficiently on a single GPU. 2. **Temporal consistency**: - Maintaining temporal consistency in video reconstruction is a challenge. VISION - XL improves temporal consistency by using the batch - consistent inversion technique and initializing latent variables with the information of measured frames. 3. **Handling of multiple spatial degradation combinations**: - The paper shows that VISION - XL can handle complex spatio - temporal degradation combinations, such as frame averaging, super - resolution, deblurring, and inpainting tasks. 4. **Support for multiple aspect ratios and high resolutions**: - VISION - XL supports multiple aspect ratios (landscape, vertical, and square), and can reconstruct a 25 - frame video with a resolution of 1280×768 on a single NVIDIA 4090 GPU in less than 2.5 minutes. ### Specific methods and techniques The main contributions and techniques of VISION - XL include: - **Pseudo - Batch Consistent Sampling Strategy**: It is used to manage high memory requirements, enabling the method to run on a single GPU. - **Batch - Consistent Inversion**: The latent representation of measured frames is used during initialization to enhance temporal consistency. - **Low - Pass Filtered Encoding**: A low - pass filter is applied in the early stage to obtain more natural and refined results. - **Multi - step Conjugate Gradient Optimization**: It ensures data consistency and improves reconstruction quality. ### Experimental results The experimental results show that VISION - XL significantly outperforms existing methods in multiple spatio - temporal inverse problems, especially in terms of metrics such as FVD (Fréchet Video Distance), PSNR (Peak Signal - to - Noise Ratio), and SSIM (Structural Similarity Index). In addition, VISION - XL also demonstrates strong performance in the reconstruction of videos with different aspect ratios and high resolutions. In general, VISION - XL provides an efficient and high - quality solution for solving complex spatio - temporal inverse problems in high - resolution videos.

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Solving Video Inverse Problems Using Image Diffusion Models

Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models

Efficiently Exploiting Spatially Variant Knowledge for Video Deblurring

Towards Interpretable Video Super-Resolution Via Alternating Optimization

Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition

PixRevive: Latent Feature Diffusion Model for Compressed Video Quality Enhancement

An Efficient Algorithm for Video Super-Resolution Based On a Sequential Model

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

HDR Video Reconstruction with Tri-Exposure Quad-Bayer Sensors

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

Deformable Kernel Convolutional Network for Video Extreme Super-Resolution

Diffusion-Promoted HDR Video Reconstruction

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model