Abstract:The fundamental challenge in video generation is not only generating high-quality image sequences but also generating consistent frames with no abrupt shifts. With the development of generative adversarial networks (GANs), great progress has been made in image generation tasks which can be used for facial expression synthesis. Most previous works focused on synthesizing frontal and near frontal faces and manual annotation. However, considering only the frontal and near frontal area is not sufficient for many real-world applications, and manual annotation fails when the video is incomplete. AffineGAN, a recent study, uses affine transformation in latent space to automatically infer the expression intensity value; however, this work requires extraction of the feature of the target ground truth image, and the generated sequence of images is also not sufficient. To address these issues, this study is proposed to infer the expression of intensity value automatically without the need to extract the feature of the ground truth images. The local dataset is prepared with frontal and with two different face positions (the left and right sides). Average content distance metrics of the proposed solution along with different experiments have been measured, and the proposed solution has shown improvements. The proposed method has improved the ACD-I of affine GAN from 1.606 ± 0.018 to 1.584 ± 0.00, ACD-C of affine GAN from 1.452 ± 0.008 to 1.430 ± 0.009, and ACD-G of affine GAN from 1.769 ± 0.007 to 1.744 ± 0.01, which is far better than AffineGAN. This work concludes that integrating self-attention into the generator network improves a quality of the generated images sequences. In addition, evenly distributing values based on frame size to assign expression intensity value improves the consistency of image sequences being generated. It also enables the generator to generate different frame size videos while remaining within the range [0, 1].

Generative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish

Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis

Audio-driven Talking Face Video Generation with Natural Head Pose

Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism

Optimal text-to-image synthesis model for generating portrait images using generative adversarial network techniques

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Realistic Speech-Driven Facial Animation with GANs

Generative Adversarial Network for Text-to-Face Synthesis and Manipulation

A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis

GANimation: Anatomically-aware Facial Animation from a Single Image

GANterpretations

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN: FEV-GAN

Attention-Based Image-to-Video Translation for Synthesizing Facial Expression Using GAN

Towards Open-World Text-Guided Face Image Generation and Manipulation

Towards Automatic Face-to-Face Translation

Multi Modal Adaptive Normalization for Audio to Video Generation