Abstract:In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: <a class="link-external link-https" href="https://ir1d.github.io/Cavia/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems in video generation: **3D consistency** and **camera controllability**. Specifically: 1. **3D consistency**: Although existing image - to - video generation models have made significant progress in generating high - quality videos, there are still challenges in maintaining geometric consistency (i.e., 3D consistency) among the generated frames. Especially in multi - view generation, video frames from different views often cannot be consistent. 2. **camera controllability**: Current methods have difficulty in precisely controlling the motion path of the camera when generating videos. Users hope to be able to specify the motion trajectory of the camera, and the generated video can follow these instructions while maintaining the realism of object motion. To address these challenges, the paper proposes a new framework named **CAVIA**, which has the following features: - **Multi - view video generation**: CAVIA can generate multiple spatio - temporally consistent videos from a single input image, and these videos can come from different camera paths. - **View - integrated attention mechanism**: By introducing a 3D attention module across views and frames, the consistency of views and time is enhanced. - **Joint training strategy**: Using a mixture of static, monocular dynamic, and multi - view dynamic video data for joint training to ensure the geometric consistency of the generated results, high - quality object motion, and background fidelity. ### Main contributions of the paper 1. **Proposing a new framework**: CAVIA for generating multi - view videos with camera controllability. Introducing a view - integrated attention mechanism, including 3D attention across views and frames, to enhance the consistency between views and frames. 2. **Effective joint training strategy**: By combining data sources of static, monocular dynamic, and multi - view dynamic videos, ensuring the geometric consistency of the generated results, high - quality object motion, and background fidelity. 3. **Experimental verification**: Through extensive experiments, it has been proven that CAVIA has superior performance in monocular video generation and cross - video consistency, and is superior to existing methods both qualitatively and quantitatively. In addition, this framework can generate four views during the inference process and support 3D reconstruction of the generated frames. ### Formula examples Some of the formulas involved in the paper are as follows: - Probability flow ordinary differential equation (ODE) of the diffusion model: \[ dx = -\frac{\dot{\sigma}(t)}{\sigma(t)} \nabla_x \log p(x; \sigma(t)) dt \] - Parameterization of the noise estimator \(D_\theta\): \[ D_\theta = c_{\text{skip}} x + c_{\text{out}} F_\theta(c_{\text{in}} x; c_{\text{noise}}) \] These formulas show the core mechanism of the diffusion model and how to parameterize the noise estimator \(D_\theta\) through the neural network \(F_\theta\). ### Summary By introducing the view - integrated attention mechanism and the joint training strategy, this paper successfully solves the 3D consistency and camera controllability problems in video generation, providing a new solution for multi - view video generation.

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

CamI2V: Camera-Controlled Image-to-Video Diffusion Model

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

A Network-Friendly Architecture for Multi-View Video Coding (Mvc)

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Training-free Camera Control for Video Generation

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

COMD: Training-free Video Motion Transfer with Camera-Object Motion Disentanglement

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

DiVE: DiT-based Video Generation with Enhanced Control

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

CaV3: Cache-assisted Viewport Adaptive Volumetric Video Streaming

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion