Abstract:Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at <a class="link-external link-https" href="https://customize-a-video.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the problem of **one-shot video motion customization**. Specifically: 1. **Background and Challenges**: - Current text-to-video generation models (T2V) can generate imaginative videos based on textual descriptions, but they struggle with precise motion control and often require complex prompt engineering. - Existing video editing methods can achieve motion transfer to some extent, but these methods strictly adhere to the structure and layout of the reference frames, failing to provide diversity in the motion itself. 2. **Objectives**: - Propose a new method called **Customize-A-Video**, which is based on a pre-trained T2V diffusion model. This method learns motion features from a single reference video to achieve motion customization for new subjects and scenes. - This method not only accurately transfers motion but also introduces variations in motion intensity, position, number of subjects, and camera angles, making the output video more vivid and interesting. 3. **Main Contributions**: - Introduced a new one-shot video motion customization task and presented a one-shot method called Customize-A-Video based on a pre-trained T2V diffusion model. - Introduced **Temporal LoRA** to learn motion from a single reference video, achieving not only accurate motion transfer but also diverse variations. - Proposed the **Appearance Absorbers** module, specifically designed to decompose spatial information from the reference video, effectively eliminating its influence on the motion customization process. - The proposed modules are plug-and-play and can be easily extended to various downstream applications. Through this research, the paper aims to advance the field of video creation, making it easier for users to create high-quality videos that meet specific needs.

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

MotionCrafter: One-Shot Motion Customization of Diffusion Models

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Still-Moving: Customized Video Generation without Customized Video Data

NewMove: Customizing text-to-video models with novel motions

Motion Inversion for Video Customization

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

MotionBooth: Motion-Aware Customized Text-to-Video Generation

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models