ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

David Junhao Zhang,Roni Paiss,Shiran Zada,Nikhil Karnad,David E. Jacobs,Yael Pritch,Inbar Mosseri,Mike Zheng Shou,Neal Wadhwa,Nataniel Ruiz
2024-11-08
Abstract:Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics,Machine Learning
What problem does this paper attempt to address?
### Problem the Paper Attempts to Solve This paper aims to address the problem of generating new videos with novel camera trajectories from user-provided videos. Specifically, existing methods can generate videos with controllable camera trajectories, but these methods cannot be directly applied to user-provided videos that are not generated by video models. Therefore, the paper proposes the **ReCapture** method, which can generate new videos with novel camera trajectories from a single user-provided video. This method not only regenerates all existing scene motions in the reference video but also allows the scene to be observed from different angles with cinematic camera movements. Additionally, ReCapture can reasonably infer the unseen parts of the reference video. ### Main Contributions 1. **Generating New Camera Trajectories**: ReCapture can generate new videos with novel camera trajectories from user-provided videos, preserving all complex scene motions from the original video. 2. **Reasonable Inference of Unobserved Scene Parts**: By generating noise-anchored videos and using masked video fine-tuning techniques, ReCapture can reasonably infer the unobserved parts of the reference video. 3. **Two-Stage Method**: - **Stage 1**: Generate noise-anchored videos with novel camera trajectories. This step can be achieved through multi-view diffusion models or depth-based point cloud rendering. - **Stage 2**: Use masked video fine-tuning techniques to generate clean and temporally consistent re-angled videos. This stage includes training temporal motion LoRA and spatial LoRA to correct errors and inconsistencies in the noise-anchored videos and fill in missing information. ### Technical Details - **Point Cloud Sequence Rendering**: Convert each video frame into a 3D point cloud representation, then reproject the point cloud according to the new camera trajectory to generate new views. - **Multi-View Image Diffusion**: Use multi-view diffusion models to generate new views for each frame, suitable for camera trajectories involving large rotations and viewpoint changes. - **Masked Video Fine-Tuning**: Generate high-quality video output by training context-aware spatial LoRA and temporal motion LoRA on known pixels. The masked loss function excludes invalid regions in the noise-anchored videos, ensuring the model learns only from meaningful pixels. ### Experimental Results - **Quantitative Evaluation**: Experimental results on the Kubric-4D dataset show that ReCapture outperforms existing 4D reconstruction and generation methods on multiple automatic metrics (e.g., PSNR, SSIM, LPIPS). - **User Study**: Further validation of ReCapture's superiority through user studies, particularly in terms of subject consistency, background consistency, and temporal smoothness. - **Ablation Study**: Demonstrated the importance of each component through ablation studies, proving the effectiveness of the masked video fine-tuning technique. In summary, ReCapture provides an effective method to add dynamic camera movements to user-provided videos and generate high-quality outputs without relying on large-scale 4D multi-view video data.