EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang,Ruida Li,Zhifeng Zhang,Shuo Fang,Chenguang Ma
2024-12-02
Abstract:This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve fine - grained expression control while maintaining a high degree of identity consistency when generating portrait images. Existing methods can usually only generate portraits with neutral or stereotypical expressions. Even with the addition of control signals such as facial landmarks, these models still have difficulties in following user instructions to generate accurate and vivid expressions. The paper proposes a new method named EmojiDiff, which aims to overcome the limitations of existing methods by directly accepting RGB expression images as input templates and providing extremely accurate and fine - grained expression control during the diffusion process. ### Main Contributions 1. **End - to - End Solution**: An end - to - end solution that integrates fine - grained expression control, high - fidelity identity preservation, and strong adaptability to various diffusion models is proposed. 2. **Decoupling Scheme**: A decoupling scheme is innovatively proposed to separate expression features from irrelevant information (such as identity, skin, and style) from RGB expression images. Identity leakage and style leakage are effectively prevented through ID - Irrelevant Data Iteration (IDI) and selective feature injection into the expression - sensitive layer. 3. **ID - Enhanced Contrastive Alignment**: A fine - tuning strategy - ID - Enhanced Contrastive Alignment (ICA) is proposed to ensure that the generated portraits can still maintain stable identity features under different expressions, thereby improving identity fidelity. 4. **Extensive Evaluation**: The performance of the proposed method is extensively evaluated on different base models and expressions. The results show that EmojiDiff performs better than existing methods. ### Method Overview 1. **Preliminary**: - **Latent Diffusion Model**: It involves the diffusion process and the reverse process in the latent space. - **Image Prompt Adapter**: A new method is introduced to combine image prompts and text prompts to control image generation. 2. **Basic E - Adapter Training**: - The basic E - Adapter is trained using a data set of the same identity to ensure that the generated image is highly consistent with the reference image in terms of expression, although identity leakage may occur. 3. **ID - Irrelevant Data Iteration**: - The identity of the generated image is modified by re - using the identity control branch of the basic E - Adapter, further increasing the identity difference between the generated image and the expression reference. 4. **Refined E - Adapter Training**: - The refined E - Adapter is trained using the newly generated cross - identity data set to achieve precise control of expression details while preventing identity leakage. 5. **ID - Enhanced Contrastive Alignment**: - The refined E - Adapter is further fine - tuned by introducing expression loss and identity loss to reduce the negative impact of expression control on identity control. ### Experimental Results - **Quantitative Results**: EmojiDiff significantly outperforms other methods in terms of ID similarity, image quality, and expression control ability. - **Qualitative Results**: It can more accurately and robustly transfer subtle facial expressions (such as pouting, single - eye blinking, and pupil movement) while maintaining the identity of the source portrait, and it can also perform well even in artistic styles (such as anime and ink - wash painting). Through these contributions, EmojiDiff provides a new solution for generating portrait images with high identity consistency and rich expressions.