Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Dejia Xu,Yifan Jiang,Chen Huang,Liangchen Song,Thorsten Gernoth,Liangliang Cao,Zhangyang Wang,Hao Tang
2024-10-15
Abstract:In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: <a class="link-external link-https" href="https://ir1d.github.io/Cavia/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems in video generation: **3D consistency** and **camera controllability**. Specifically: 1. **3D consistency**: Although existing image - to - video generation models have made significant progress in generating high - quality videos, there are still challenges in maintaining geometric consistency (i.e., 3D consistency) among the generated frames. Especially in multi - view generation, video frames from different views often cannot be consistent. 2. **camera controllability**: Current methods have difficulty in precisely controlling the motion path of the camera when generating videos. Users hope to be able to specify the motion trajectory of the camera, and the generated video can follow these instructions while maintaining the realism of object motion. To address these challenges, the paper proposes a new framework named **CAVIA**, which has the following features: - **Multi - view video generation**: CAVIA can generate multiple spatio - temporally consistent videos from a single input image, and these videos can come from different camera paths. - **View - integrated attention mechanism**: By introducing a 3D attention module across views and frames, the consistency of views and time is enhanced. - **Joint training strategy**: Using a mixture of static, monocular dynamic, and multi - view dynamic video data for joint training to ensure the geometric consistency of the generated results, high - quality object motion, and background fidelity. ### Main contributions of the paper 1. **Proposing a new framework**: CAVIA for generating multi - view videos with camera controllability. Introducing a view - integrated attention mechanism, including 3D attention across views and frames, to enhance the consistency between views and frames. 2. **Effective joint training strategy**: By combining data sources of static, monocular dynamic, and multi - view dynamic videos, ensuring the geometric consistency of the generated results, high - quality object motion, and background fidelity. 3. **Experimental verification**: Through extensive experiments, it has been proven that CAVIA has superior performance in monocular video generation and cross - video consistency, and is superior to existing methods both qualitatively and quantitatively. In addition, this framework can generate four views during the inference process and support 3D reconstruction of the generated frames. ### Formula examples Some of the formulas involved in the paper are as follows: - Probability flow ordinary differential equation (ODE) of the diffusion model: \[ dx = -\frac{\dot{\sigma}(t)}{\sigma(t)} \nabla_x \log p(x; \sigma(t)) dt \] - Parameterization of the noise estimator \(D_\theta\): \[ D_\theta = c_{\text{skip}} x + c_{\text{out}} F_\theta(c_{\text{in}} x; c_{\text{noise}}) \] These formulas show the core mechanism of the diffusion model and how to parameterize the noise estimator \(D_\theta\) through the neural network \(F_\theta\). ### Summary By introducing the view - integrated attention mechanism and the joint training strategy, this paper successfully solves the 3D consistency and camera controllability problems in video generation, providing a new solution for multi - view video generation.