SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Jianyi Wang,Zhijie Lin,Meng Wei,Yang Zhao,Ceyuan Yang,Chen Change Loy,Lu Jiang
2025-01-03
Abstract:Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenge in Generic Video Restoration (VR), that is, how to restore temporally consistent details from unknown degradations while maintaining high fidelity. Although restoration methods based on diffusion models have made certain progress in recent years, these methods usually have limitations in terms of generation ability and sampling efficiency. Specifically: 1. **Processing videos of arbitrary resolution and length**: Existing methods face the problems of high computational cost and performance degradation when dealing with long videos or high - resolution videos. Especially when the input resolution is different from the resolution used in training, these problems are more prominent. 2. **Improving sampling efficiency**: Many existing methods rely on block - sampling strategies, which will lead to a significant slowdown in inference speed, especially when dealing with long - time and high - resolution videos. 3. **Improving generation quality**: Existing methods often have difficulty in generating realistic textures and details when dealing with complex and unknown degradations, especially performing poorly on real - world data. To solve the above problems, this paper proposes SeedVR, a model based on Diffusion Transformer, aiming to efficiently handle real - world video restoration tasks of any length and resolution. SeedVR effectively solves the limitations of existing methods by introducing the Shifted Window Attention Mechanism and the Causal Video Autoencoder, and achieves excellent performance in multiple benchmark tests. ### Key contributions 1. **Shifted Window Attention Mechanism**: By using a larger non - overlapping window attention mechanism, SeedVR can achieve competitive video restoration quality at a lower computational cost. Especially when dealing with arbitrary input resolutions, SeedVR performs well. 2. **Causal Video Autoencoder**: This design significantly improves training and inference efficiency while maintaining high - quality video reconstruction ability. 3. **Large - scale joint training**: By performing multi - scale progressive training on images and videos, SeedVR reaches the state - of - the - art performance in various benchmark tests, surpassing existing methods. ### Summary The main goal of SeedVR is to solve the problems of high computational cost, low sampling efficiency, and poor generation quality of existing video restoration methods when dealing with videos of arbitrary length and resolution, thereby providing a more efficient and higher - quality video restoration solution for practical applications.