Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren,Yang Zhou,Jimei Yang,Jing Shi,Difan Liu,Feng Liu,Mingi Kwon,Abhinav Shrivastava
2024-08-28
Abstract:Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at <a class="link-external link-https" href="https://customize-a-video.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the problem of **one-shot video motion customization**. Specifically: 1. **Background and Challenges**: - Current text-to-video generation models (T2V) can generate imaginative videos based on textual descriptions, but they struggle with precise motion control and often require complex prompt engineering. - Existing video editing methods can achieve motion transfer to some extent, but these methods strictly adhere to the structure and layout of the reference frames, failing to provide diversity in the motion itself. 2. **Objectives**: - Propose a new method called **Customize-A-Video**, which is based on a pre-trained T2V diffusion model. This method learns motion features from a single reference video to achieve motion customization for new subjects and scenes. - This method not only accurately transfers motion but also introduces variations in motion intensity, position, number of subjects, and camera angles, making the output video more vivid and interesting. 3. **Main Contributions**: - Introduced a new one-shot video motion customization task and presented a one-shot method called Customize-A-Video based on a pre-trained T2V diffusion model. - Introduced **Temporal LoRA** to learn motion from a single reference video, achieving not only accurate motion transfer but also diverse variations. - Proposed the **Appearance Absorbers** module, specifically designed to decompose spatial information from the reference video, effectively eliminating its influence on the motion customization process. - The proposed modules are plug-and-play and can be easily extended to various downstream applications. Through this research, the paper aims to advance the field of video creation, making it easier for users to create high-quality videos that meet specific needs.