CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Andrew Marmon,Grant Schindler,José Lezama,Dan Kondratyuk,Bryan Seybold,Irfan Essa
2024-05-22
Abstract:We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimensional camera movement over the course of the generated video. Results demonstrate that we are (1) able to successfully control the camera during video generation, starting from a single frame and a camera signal, and (2) we demonstrate the accuracy of the generated 3D camera paths using traditional computer vision methods.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of controlling 3D camera motion in video generation. Existing video generation models are becoming more powerful but often lack the ability to independently control scene dynamics and camera movement. The paper proposes a method that enables video generation to be controlled based on virtual 3D camera control by treating 3D camera motion as a conditional signal for a multimodal transformer. This approach allows successful control of camera motion when generating videos from single images and 3D camera paths and validates the accuracy of the 3D camera paths in the generated videos. The researchers extend the multimodal transformer by adding input channels for 3D camera motion, enabling control of 3D camera motion in a non-textual form instead of relying solely on textual prompts like "zoom" or "pan." The results show that this approach can control 3D camera motion during video generation while automatically handling occluded regions completion and newly revealed areas rendering. The main contribution of the paper is a method for generating videos from single scene images based on camera motion instructions. The method not only allows controlled changes of the 3D viewpoint of the scene but also permits scene dynamics and automatically handles all occlusions and newly appeared areas. The researchers demonstrate the performance of the camera control method through quantitative evaluation and qualitative comparisons.