Abstract:Image animation is to animate a still image of the object of interest using poses extracted from another video sequence. Through training on a large-scale video dataset, most existing approaches aim to explore disentangled appearance and pose representations of training frames. Then, the desired output with a specific appearance and pose can be synthesized via recombining learned representations. However, in some real-world applications, test images may lack the corresponding video ground-truth or follow a different distribution than the distribution of the training video frames (i.e., different domains), which largely limit the performance of existing methods. In this paper, we propose domain-independent pose representations that are compatible with and accessible by still images from a different domain. Specifically, we devise a two-stage self-supervised pose adaptation framework for general image animation tasks. A domain-independent pose adaptation generative adversarial network (DIPA-GAN) and a shuffle-patch generative adversarial network (Shuffle-patch GAN) are proposed to penalize the rationality of the synthesized frame's pose and appearance, respectively. Finally, experiments evaluated on various image animation tasks, which include same/cross-domain moving objects, facial expression transfer and human pose retargeting, demonstrate the superiority of the proposed framework over prior literature. Impact Statement—Image animation is a popular technology in video production. Benefiting from the rapid development of artificial intelligence (AI), recent image animation algorithms have been widely used in real-world applications, such as virtual AI news anchor, virtual try-on, and face swapping. However, most existing methods are designed for specific cases. To animate a new portrait, users are asked to collect hundreds of images of the same person and train a new model. The technology proposed in this paper overcomes these training limitations and generalizes image animations. In the challenging cross-domain facial expression transfer task, the user study demonstrated that our technology achieved more than 20% increase in animation success rate. The proposed technology could benefit users in a wide variety of industries including movie production, virtual reality, social media and online retail.

Synthesizing Videos from Images for Image-to-Video Adaptation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Ivs-Net: Learning Human View Synthesis from Internet Videos

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Probabilistic Adaptation of Text-to-Video Models

Adaptive Compact Attention For Few-shot Video-to-video Translation

Dynamic and Compressive Adaptation of Transformers From Images to Videos

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Image to Video Domain Adaptation Using Web Supervision

Consistent Video-to-Video Transfer Using Synthetic Dataset

Memory Efficient Temporal & Visual Graph Model for Unsupervised Video Domain Adaptation

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Translation-based Video-to-Video Synthesis

Self-Supervised Pose Adaptation for Cross-Domain Image Animation.

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation