I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Wanquan Feng,Jiawei Liu,Pengqi Tu,Tianhao Qi,Mingzhen Sun,Tianxiang Ma,Songtao Zhao,Siyu Zhou,Qian He
2024-11-11
Abstract:Video generation technologies are developing rapidly and have broad potential applications. Among these technologies, camera control is crucial for generating professional-quality videos that accurately meet user expectations. However, existing camera control methods still suffer from several limitations, including control precision and the neglect of the control for subject motion dynamics. In this work, we propose I2VControl-Camera, a novel camera control method that significantly enhances controllability while providing adjustability over the strength of subject motion. To improve control precision, we employ point trajectory in the camera coordinate system instead of only extrinsic matrix information as our control signal. To accurately control and adjust the strength of subject motion, we explicitly model the higher-order components of the video trajectory expansion, not merely the linear terms, and design an operator that effectively represents the motion strength. We use an adapter architecture that is independent of the base model structure. Experiments on static and dynamic scenes show that our framework outperformances previous methods both quantitatively and qualitatively.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in video generation technology, there are some limitations in existing camera control methods, including insufficient control precision and neglect of the control of the main body motion dynamics. Specifically, the paper proposes a new camera control method - I2VControl - Camera, aiming to significantly improve controllability and provide the ability to adjust the intensity of the main body motion. The paper improves the control precision by using the point trajectories in the camera coordinate system as control signals instead of relying solely on external matrix information. In addition, in order to accurately control and adjust the intensity of the main body motion, the paper explicitly models the high - order components of video trajectory expansion, not just the linear terms, and designs an operator that can effectively represent the motion intensity. This method not only improves the precise control of camera motion but also allows users to adjust the intensity of the main body motion in the video, thereby generating professional - quality videos that are more in line with user expectations. ### Main Contributions 1. **Explicitly model decoupled motion representations**: 3D rigid point trajectories and motion intensities, which are respectively used for camera and main body motion control. 2. **Propose to construct a data pipeline for training control signals**: Register 3D tracking information and motion masks from RGB videos. 3. **Outperform existing methods in both static and dynamic scenes**: Perform better both quantitatively and qualitatively. ### Method Overview - **Video Representation and Notation**: Define that the coordinates of all points are in the camera coordinate system and divide the entire 3D world into a static part and a dynamic part, where the static part corresponds to the linear motion in the camera coordinate system. - **Control Signal Construction**: Define the point trajectory Tλ on the camera plane by calculating the linear translation of the 3D point area Ω captured in the first frame and projecting it onto 2D. To overcome the problem of motion suppression in the nonlinear part, the paper further models the motion of the nonlinear part (the dynamic area in the world system) and quantifies the degree of motion dynamics at time λ by the first - order derivative of time λ. - **Data Pipeline**: The paper addresses several major gaps between actual RGB video data and continuous trajectory functions, including the lack of 3D information, the lack of time correspondence, and the lack of dynamic / static partitioning. These gaps are filled by using metric depth estimation methods and tracking methods, and the static and dynamic areas are extracted in an iterative manner. - **Network Structure, Training and Inference**: The paper adopts an adaptive structure, enabling the method to be compatible with rapidly evolving base models. The network design allows the integration of control features into any diffusion process, thus adapting to various video - generation base frameworks. ### Experimental Results - **Visualization Results**: Show the effects of the method in pixel - level control and motion intensity adjustment. When the motion intensity is set to 0, the image content is almost stationary; as the motion intensity increases, the main objects in the scene begin to show motion. - **Comparative Experiments**: Compared with previous baseline methods (such as MotionCtrl and CameraCtrl), the results show that under the same experimental settings, the proposed method performs excellently in terms of control precision and motion intensity adjustment. In conclusion, through the introduction of new control signals and modeling methods, this paper significantly improves the precision and flexibility of camera control in video generation, especially when dealing with dynamic scenes.