AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo,Ceyuan Yang,Anyi Rao,Zhengyang Liang,Yaohui Wang,Yu Qiao,Maneesh Agrawala,Dahua Lin,Bo Dai
2024-02-09
Abstract:With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at <a class="link-external link-https" href="https://github.com/guoyww/AnimateDiff" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of how to directly convert existing high-quality personalized text-to-image (T2I) models into animation generators without the need for fine-tuning specific models. Specifically: 1. **Problem Background**: With the development of text-to-image diffusion models (such as Stable Diffusion) and the advancement of personalization techniques (such as DreamBooth and LoRA), users can easily create high-quality images using consumer-grade hardware (such as laptops with RTX 3080 graphics cards). However, extending these high-quality personalized T2I models to animation generation remains a challenge. 2. **Objective**: Propose a method that can directly convert existing personalized T2I models into animation generators without the need for specific model fine-tuning. This is often impractical for amateur users in terms of computational cost and data collection. ### Main Contributions 1. **AnimateDiff Framework**: Propose an efficient pipeline that can convert any personalized T2I model into an animation generator without specific fine-tuning. 2. **Transformer Architecture**: Validate that the Transformer architecture is sufficient for modeling motion priors, providing valuable insights for video generation. 3. **MotionLoRA Technique**: Propose a lightweight fine-tuning technique that allows pre-trained motion modules to adapt to new motion patterns with a small amount of reference videos and training iterations. 4. **Comprehensive Evaluation**: Conduct a comprehensive evaluation using representative personalized T2I models collected from the community and compare them with academic benchmarks and commercial tools. Through these contributions, the paper demonstrates how to effectively convert existing personalized T2I models into animation generators while maintaining visual quality and motion diversity.