Abstract:Generating high-quality stereo videos that mimic human binocular vision requires maintaining consistent depth perception and temporal coherence across frames. While diffusion models have advanced image and video synthesis, generating high-quality stereo videos remains challenging due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce \textit{StereoCrafter-Zero}, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without the need for paired training data. Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Comprehensive evaluations, including quantitative metrics and user studies, demonstrate that \textit{StereoCrafter-Zero} produces high-quality stereo videos with improved depth consistency and temporal smoothness, even when depth estimations are imperfect. Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation and enabling more immersive visual experiences. Our code can be found in~\url{<a class="link-external link-https" href="https://github.com/shijianjian/StereoCrafter-Zero" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - quality stereoscopic videos without paired training data. Specifically, the paper focuses on how to generate high - quality stereoscopic videos that can simulate the binocular vision effect of human eyes while maintaining temporal coherence and spatial consistency. This involves the following key challenges: 1. **Temporal flicker and inconsistent viewpoints**: When generating stereoscopic videos, temporal flicker and inconsistent viewpoints between the left - view and the right - view are common problems, which will seriously affect the user's viewing experience. 2. **Consistency of depth perception**: During the process of generating stereoscopic videos, it is necessary to ensure that the depth perception between the left - view and the right - view remains consistent in order to achieve a realistic parallax effect. 3. **Robustness in dynamic scenes**: When dealing with dynamic scenes, existing methods are often difficult to adapt to the changes of the scenes, resulting in a decline in the quality of the generated videos. To address these challenges, the paper proposes **StereoCrafter - Zero**, a new zero - sample stereoscopic video generation framework that utilizes video diffusion priors to generate high - quality stereoscopic videos. The key innovation points of this framework include: - **Noise restart strategy**: By introducing noise to initialize the latent variables of stereoscopic perception, the robustness and diversity of the generated videos are enhanced. - **Iterative refinement process**: By gradually injecting control noise, the harmony of the latent space is systematically improved, and the problems of temporal flicker and inconsistent viewpoints are solved. Through quantitative evaluation and user studies, the paper proves that **StereoCrafter - Zero** can generate high - quality stereoscopic videos with better depth consistency and temporal smoothness, and can perform well even when the depth estimation is imperfect. In addition, this framework has good adaptability and robustness in various diffusion models, setting a new benchmark for zero - sample stereoscopic video generation.

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning

Stereoscopic Video Synthesis from A Monocular Video

Fine-gained Zero-shot Video Sampling

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

ActiveZero++: Mixed Domain Learning Stereo and Confidence-based Depth Completion with Zero Annotation

High-Quality Depth Recovery Via Interactive Multi-view Stereo

DiffuStereo: High Quality Human Reconstruction via Diffusion-based Stereo Using Sparse Cameras

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

A Temporally Streamlined Optimization Method for Stereo Video Correspondence

Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Real-Time 3D Video Synthesis from Binocular Stereo Camera

A Unified Scheme for Super-Resolution and Depth Estimation from Asymmetric Stereoscopic Video

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing