Abstract:Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.

What problem does this paper attempt to address?

### Problem the Paper Attempts to Solve This paper aims to address the problem of generating new videos with novel camera trajectories from user-provided videos. Specifically, existing methods can generate videos with controllable camera trajectories, but these methods cannot be directly applied to user-provided videos that are not generated by video models. Therefore, the paper proposes the **ReCapture** method, which can generate new videos with novel camera trajectories from a single user-provided video. This method not only regenerates all existing scene motions in the reference video but also allows the scene to be observed from different angles with cinematic camera movements. Additionally, ReCapture can reasonably infer the unseen parts of the reference video. ### Main Contributions 1. **Generating New Camera Trajectories**: ReCapture can generate new videos with novel camera trajectories from user-provided videos, preserving all complex scene motions from the original video. 2. **Reasonable Inference of Unobserved Scene Parts**: By generating noise-anchored videos and using masked video fine-tuning techniques, ReCapture can reasonably infer the unobserved parts of the reference video. 3. **Two-Stage Method**: - **Stage 1**: Generate noise-anchored videos with novel camera trajectories. This step can be achieved through multi-view diffusion models or depth-based point cloud rendering. - **Stage 2**: Use masked video fine-tuning techniques to generate clean and temporally consistent re-angled videos. This stage includes training temporal motion LoRA and spatial LoRA to correct errors and inconsistencies in the noise-anchored videos and fill in missing information. ### Technical Details - **Point Cloud Sequence Rendering**: Convert each video frame into a 3D point cloud representation, then reproject the point cloud according to the new camera trajectory to generate new views. - **Multi-View Image Diffusion**: Use multi-view diffusion models to generate new views for each frame, suitable for camera trajectories involving large rotations and viewpoint changes. - **Masked Video Fine-Tuning**: Generate high-quality video output by training context-aware spatial LoRA and temporal motion LoRA on known pixels. The masked loss function excludes invalid regions in the noise-anchored videos, ensuring the model learns only from meaningful pixels. ### Experimental Results - **Quantitative Evaluation**: Experimental results on the Kubric-4D dataset show that ReCapture outperforms existing 4D reconstruction and generation methods on multiple automatic metrics (e.g., PSNR, SSIM, LPIPS). - **User Study**: Further validation of ReCapture's superiority through user studies, particularly in terms of subject consistency, background consistency, and temporal smoothness. - **Ablation Study**: Demonstrated the importance of each component through ablation studies, proving the effectiveness of the masked video fine-tuning technique. In summary, ReCapture provides an effective method to add dynamic camera movements to user-provided videos and generate high-quality outputs without relying on large-scale 4D multi-view video data.

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

FaceSwapNet: Landmark Guided Many-to-Many Face Reenactment

ReVideo: Remake a Video with Motion and Content Control

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

Training-free Camera Control for Video Generation

Replace Anyone in Videos

Replay: Multi-modal Multi-view Acted Videos for Casual Holography

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Controllable Free Viewpoint Video Reconstruction Based on Neural Radiance Fields and Motion Graphs.

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Neural Rendering and Reenactment of Human Actor Videos

Audio-driven Neural Gesture Reenactment with Video Motion Graphs

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Facial Reenactment Through a Personalized Generator

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

View Synthesis of Dynamic Scenes based on Deep 3D Mask Volume

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers