Abstract:The existing image-to-video translation methods generally follow a frame-by-frame generative paradigm, while extracting the temporal information from a reference video or an audio stream. Inspired by the recent success in text-guided image generation, we explore a more challenging but promising task, Text-guided Image-to-Video (TI2V) translation. Given an image and a brief text description as input, TI2V aims to generate a facial expression video following the image and text. To this end, we first propose an automatic video captioning pipeline to generate dense textual descriptions for facial video datasets, using both expression labels and action units. These dense textual descriptions provide precise semantic guidance for TI2V learning. Then we design and train an efficient framework, FaceCLIP, on these datasets to deal with the TI2V translation task. FaceCLIP adopts a video autoencoder to model the temporal information of training videos, and a pretrained CLIP model to embed the video frames and the text description. We design a reconstruction loss and an embedding alignment loss to train the autoencoder to obtain the text-guided video generative ability. Recognizing that expressions are closely tied to facial landmark motions, the reconstruction loss is applied to facial landmarks rather than each video frame, significantly enhancing training efficiency. We compare FaceCLIP with several potential baseline methods, and extensively evaluate the performance using multiple metrics. Both qualitative and quantitative results validate the superiority of FaceCLIP in terms of both visual quality and expression-text consistency. Moreover, the unique ability of FaceCLIP to generate videos based on abstract texts demonstrates its stronger generalization capability.

CRFAST: Clip-Based Reference-Guided Facial Image Semantic Transfer

Expression Conditional Gan for Facial Expression-to-Expression Translation.

FaceSwapNet: Landmark Guided Many-to-Many Face Reenactment

FaceCLIP: Facial Image-to-Video Translation Via A Brief Text Description

FacialGAN: Style Transfer and Attribute Manipulation on Synthetic Faces

Unconstrained Facial Expression Transfer using Style-based Generator

StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping

ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Facial Expression Transfer Based on Conditional Generative Adversarial Networks.

Sem-CS: Semantic CLIPStyler for Text-Based Image Style Transfer

FrseGAN: Free‐style editable facial makeup transfer based on GAN combined with transformer

CLIPstyler: Image Style Transfer with a Single Text Condition

Foreground and background separated image style transfer with a single text condition

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Semantic prior guided fine-grained facial expression manipulation

Deep Realistic Facial Editing via Label-restricted Mask Disentanglement

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation