Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Yunxin Li,Haoyuan Shi,Baotian Hu,Longyue Wang,Jiashun Zhu,Jinyi Xu,Zhen Zhao,Min Zhang
2024-08-19
Abstract:Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
Computation and Language,Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper attempts to address the limitations of traditional animation generation methods, particularly the reliance on manually annotated data for training, the multi-stage complex processes requiring significant human effort, and the typically short, information-poor, and contextually inconsistent generated animation videos. To overcome these limitations and automate the animation production process, the authors propose an autonomous agent based on large multimodal models (LMMs) — Anim-Director, which generates coherent animation videos from brief narratives or simple instructions. Specifically, Anim-Director achieves this goal through the following steps: 1. **Story Optimization**: Starting from a brief story or narrative input by the user, LMMs are used to expand it into a detailed, coherent, and plot-rich story, such as introducing character dialogues and story details while retaining character names. 2. **Script Generation**: Based on the expanded story, a detailed director's script is generated, including character settings, scene descriptions, and shot divisions. 3. **Scene Image Generation**: Using image generation tools (such as Midjourney) to generate scene images based on text descriptions, ensuring visual consistency across different scenes. 4. **Scene Image Improvement**: Evaluating the quality of generated images through a self-reflection mechanism, selecting the images that best match the scene descriptions, and further optimizing the images using image segmentation tools (such as SAM) and image region replacement functions. 5. **Video Generation**: Generating animation videos based on the generated scene images and text prompts, and predicting the optimal parameter settings for the video generation tools to capture the dynamics and visual content of the scenes. 6. **Video Quality Enhancement**: Assessing the visual quality and contextual coherence of the videos by detecting distortions and evaluating the consistency between the subject and background, and selecting the best video from multiple candidate videos. Through these steps, Anim-Director significantly simplifies the animation production process, improving the quality and coherence of the generated animations without the need for human intervention. This makes animation production more efficient and automated, reducing the reliance on large studio resources.