Abstract:Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate. In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {<a class="link-external link-https" href="https://github.com/tvaranka/fineface" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Current generation models lack local, fine - grained control ability when generating facial expressions, resulting in most of the generated expressions being bland neutral expressions or stereotypical smiles, lacking a sense of reality and diversity. In addition, the ability to generate unconventional facial expressions (such as suspicion, confusion, etc.) is also limited. Therefore, this paper proposes a new method that utilizes Action Units (AUs) to achieve fine - grained control over facial expression generation. ### Specific description of the problem 1. **Lack of local control**: - Current generation models are unable to precisely control specific facial muscle movements, resulting in generated expressions that are not delicate and realistic enough. 2. **Limited types of expressions**: - Most generation models can only generate basic emotional expressions (such as happiness, sadness, anger, etc.), and it is difficult to generate complex or unconventional expressions (such as suspicion, confusion, etc.). 3. **Insufficient intensity control**: - Existing methods are unable to flexibly adjust the intensity of each action unit, limiting the diversity and realism of expressions. ### Proposed solution To overcome the above problems, the author proposes a new method named FineFace. Its core idea is to achieve fine - grained control over generated facial expressions by combining multiple Action Units (AUs). Specifically: - **Action Units (AUs)**: AUs describe individual facial muscle movements based on facial anatomy, allowing for precise and local control of the intensity of muscle movements. - **AU encoder**: An AU encoder is designed to convert input commands into complex facial gestures and support continuous scales and combinations of multiple action units. - **Adapter architecture**: By using an adapter architecture, FineFace can be seamlessly integrated into existing text - to - image (T2I) generation models while accurately following AU conditions. ### Advantages of the method 1. **Precise control**: Through AUs, users can perform fine - grained control over specific facial muscles to generate diverse and realistic facial expressions. 2. **High interpretability**: AUs provide an intuitive and easy - to - understand control method, enabling users to clearly know which muscle movements correspond to which expressions. 3. **High flexibility**: It can not only generate common emotional expressions but also generate unconventional facial expressions, such as concentration, suspicion, confusion, etc. ### Experimental verification The author verified the effectiveness of FineFace through a series of experiments, including quantitative analysis and qualitative evaluation. The experimental results show that FineFace performs excellently in generating complex and unconventional facial expressions while maintaining consistency with the original prompts. In conclusion, this paper aims to enhance the control ability of generation models over facial expressions by introducing AUs, thereby generating more realistic and diverse facial expressions.

Towards Localized Fine-Grained Control for Facial Expression Generation

Expression Conditional Gan for Facial Expression-to-Expression Translation.

Cgan Based Facial Expression Recognition for Human-Robot Interaction

Toward Fine-grained Facial Expression Manipulation

Action Unit Driven Facial Expression Synthesis from a Single Image with Patch Attentive GAN.

Global-to-local Expression-aware Embeddings for Facial Action Unit Detection

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Video-driven state-aware facial animation

Semantic prior guided fine-grained facial expression manipulation

Expressive Speech-driven Facial Animation with controllable emotions

ExprGAN: Facial Expression Editing With Controllable Expression Intensity

Facial Landmarks and Expression Label Guided Photorealistic Facial Expression Synthesis

Region Based Adversarial Synthesis of Facial Action Units

GANimation: Anatomically-aware Facial Animation from a Single Image

Learning facial expression-aware global-to-local representation for robust action unit detection

Facial Prior Guided Micro-Expression Generation.

XAGen: 3D Expressive Human Avatars Generation

Expression-Guided Attention GAN for Fine-Grained Facial Expression Editing

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Local and Global Perception Generative Adversarial Network for Facial Expression Synthesis

Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation