AniClipart: Clipart Animation with Text-to-Video Priors

Ronghuan Wu,Wanchao Su,Kede Ma,Jing Liao
2024-04-19
Abstract:Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem addressed in this paper is how to effectively transform static clipart into high-quality animation sequences while preserving visual recognition and frame-to-frame consistency, especially when text prompts guide the animation generation. To solve this problem, the paper introduces the AniClipart system, which utilizes a pre-trained text-to-video diffusion model to guide the generation of animated images from static clipart while addressing the limitations of existing models in preserving clipart style and motion simplicity. The specific methods include: 1. Defining Bezier curves on key points of clipart images as smooth motion paths. 2. Aligning the motion trajectories of key points with text prompts by optimizing the Video Score Distillation Sampling (VSDS) loss function to extract natural motion knowledge. 3. Using a differentiable As-Rigid-As-Possible (ARAP) shape deformation algorithm to maintain shape rigidity during an end-to-end optimization process while updating key point positions to avoid pixel-level distortion and preserve visual recognition. Experimental results demonstrate that AniClipart outperforms existing image-to-video generation models in text-video alignment, visual identity preservation, and motion consistency. It is also capable of adapting to various animation formats, such as layered animations with topological changes.