Abstract:Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the key challenge in Generic Video Restoration (VR), that is, how to restore temporally consistent details from unknown degradations while maintaining high fidelity. Although restoration methods based on diffusion models have made certain progress in recent years, these methods usually have limitations in terms of generation ability and sampling efficiency. Specifically: 1. **Processing videos of arbitrary resolution and length**: Existing methods face the problems of high computational cost and performance degradation when dealing with long videos or high - resolution videos. Especially when the input resolution is different from the resolution used in training, these problems are more prominent. 2. **Improving sampling efficiency**: Many existing methods rely on block - sampling strategies, which will lead to a significant slowdown in inference speed, especially when dealing with long - time and high - resolution videos. 3. **Improving generation quality**: Existing methods often have difficulty in generating realistic textures and details when dealing with complex and unknown degradations, especially performing poorly on real - world data. To solve the above problems, this paper proposes SeedVR, a model based on Diffusion Transformer, aiming to efficiently handle real - world video restoration tasks of any length and resolution. SeedVR effectively solves the limitations of existing methods by introducing the Shifted Window Attention Mechanism and the Causal Video Autoencoder, and achieves excellent performance in multiple benchmark tests. ### Key contributions 1. **Shifted Window Attention Mechanism**: By using a larger non - overlapping window attention mechanism, SeedVR can achieve competitive video restoration quality at a lower computational cost. Especially when dealing with arbitrary input resolutions, SeedVR performs well. 2. **Causal Video Autoencoder**: This design significantly improves training and inference efficiency while maintaining high - quality video reconstruction ability. 3. **Large - scale joint training**: By performing multi - scale progressive training on images and videos, SeedVR reaches the state - of - the - art performance in various benchmark tests, surpassing existing methods. ### Summary The main goal of SeedVR is to solve the problems of high computational cost, low sampling efficiency, and poor generation quality of existing video restoration methods when dealing with videos of arbitrary length and resolution, thereby providing a more efficient and higher - quality video restoration solution for practical applications.

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement

Video super-resolution with phase-aided deformable alignment network

DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

VRT: A Video Restoration Transformer

Disentangle Propagation and Restoration for Efficient Video Recovery

Deep Video Restoration for Under-Display Camera

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

VEnhancer: Generative Space-Time Enhancement for Video Generation

EDVR: Video Restoration With Enhanced Deformable Convolutional Networks

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Learning Degradation-Robust Spatiotemporal Frequency-Transformer for Video Super-Resolution

Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video Restoration

Zero-shot Video Restoration and Enhancement Using Pre-Trained Image Diffusion Model

PixRevive: Latent Feature Diffusion Model for Compressed Video Quality Enhancement

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

FLAIR: A Conditional Diffusion Framework with Applications to Face Video Restoration

DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration