StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

Jian Shi,Qian Wang,Zhenyu Li,Peter Wonka
2024-11-22
Abstract:Generating high-quality stereo videos that mimic human binocular vision requires maintaining consistent depth perception and temporal coherence across frames. While diffusion models have advanced image and video synthesis, generating high-quality stereo videos remains challenging due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce \textit{StereoCrafter-Zero}, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without the need for paired training data. Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Comprehensive evaluations, including quantitative metrics and user studies, demonstrate that \textit{StereoCrafter-Zero} produces high-quality stereo videos with improved depth consistency and temporal smoothness, even when depth estimations are imperfect. Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation and enabling more immersive visual experiences. Our code can be found in~\url{<a class="link-external link-https" href="https://github.com/shijianjian/StereoCrafter-Zero" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality stereoscopic videos without paired training data. Specifically, the paper focuses on how to generate high - quality stereoscopic videos that can simulate the binocular vision effect of human eyes while maintaining temporal coherence and spatial consistency. This involves the following key challenges: 1. **Temporal flicker and inconsistent viewpoints**: When generating stereoscopic videos, temporal flicker and inconsistent viewpoints between the left - view and the right - view are common problems, which will seriously affect the user's viewing experience. 2. **Consistency of depth perception**: During the process of generating stereoscopic videos, it is necessary to ensure that the depth perception between the left - view and the right - view remains consistent in order to achieve a realistic parallax effect. 3. **Robustness in dynamic scenes**: When dealing with dynamic scenes, existing methods are often difficult to adapt to the changes of the scenes, resulting in a decline in the quality of the generated videos. To address these challenges, the paper proposes **StereoCrafter - Zero**, a new zero - sample stereoscopic video generation framework that utilizes video diffusion priors to generate high - quality stereoscopic videos. The key innovation points of this framework include: - **Noise restart strategy**: By introducing noise to initialize the latent variables of stereoscopic perception, the robustness and diversity of the generated videos are enhanced. - **Iterative refinement process**: By gradually injecting control noise, the harmony of the latent space is systematically improved, and the problems of temporal flicker and inconsistent viewpoints are solved. Through quantitative evaluation and user studies, the paper proves that **StereoCrafter - Zero** can generate high - quality stereoscopic videos with better depth consistency and temporal smoothness, and can perform well even when the depth estimation is imperfect. In addition, this framework has good adaptability and robustness in various diffusion models, setting a new benchmark for zero - sample stereoscopic video generation.