Abstract:Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

What problem does this paper attempt to address?

The problem addressed in this paper is how to effectively transform static clipart into high-quality animation sequences while preserving visual recognition and frame-to-frame consistency, especially when text prompts guide the animation generation. To solve this problem, the paper introduces the AniClipart system, which utilizes a pre-trained text-to-video diffusion model to guide the generation of animated images from static clipart while addressing the limitations of existing models in preserving clipart style and motion simplicity. The specific methods include: 1. Defining Bezier curves on key points of clipart images as smooth motion paths. 2. Aligning the motion trajectories of key points with text prompts by optimizing the Video Score Distillation Sampling (VSDS) loss function to extract natural motion knowledge. 3. Using a differentiable As-Rigid-As-Possible (ARAP) shape deformation algorithm to maintain shape rigidity during an end-to-end optimization process while updating key point positions to avoid pixel-level distortion and preserve visual recognition. Experimental results demonstrate that AniClipart outperforms existing image-to-video generation models in text-video alignment, visual identity preservation, and motion consistency. It is also capable of adapting to various animation formats, such as layered animations with topological changes.

AniClipart: Clipart Animation with Text-to-Video Priors

Reuse of Clips in Cartoon Animation Based on Language Instructions

Real-time Cartoon Water Animation

Text-Animator: Controllable Visual Text Video Generation

Performance-Driven Animation of Hand-Drawn Cartoon Faces

ClipFlip : Multi-view Clipart Design

Temporally Coherent Video Cartoonization for Animation Scenery Generation

AnimateAnything: Consistent and Controllable Animation for Video Generation

Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Breathing Life Into Sketches Using Text-to-Video Priors

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Wakey-Wakey: Animate Text by Mimicking Characters in a GIF

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance

Unsupervised Coherent Video Cartoonization with Perceptual Motion Consistency

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Template Fitting Non Parametric Sampling Stroke Library Stroke Rendering Interactive Hair Contour Extraction Interactive Face Alignment

Deep Sketch-guided Cartoon Video Inbetweening