Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Renshuai Liu,Bowen Ma,Wei Zhang,Zhipeng Hu,Changjie Fan,Tangjie Lv,Yu Ding,Xuan Cheng
2024-04-07
Abstract:In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of simultaneously controlling identity and expression in personalized face generation. Specifically, existing pre - trained text - to - image models have difficulty retaining individual identity characteristics while showing diverse expressions when generating portrait pictures that meet user requirements. In addition, the granularity of expression control in existing methods is still relatively coarse, usually limited to seven to eight common labels (such as "surprised", "happy", "angry", etc.), which cannot fully cover the entire emotional space in the open world. To overcome these problems, the paper proposes a new multi - modal face generation framework that can achieve simultaneous control of identity and expression and finer - grained expression synthesis. The core technology of this framework is a novel diffusion model that can perform simultaneous face swapping and reenactment tasks (Simultaneous Face Swapping and Reenactment, SFSR). By introducing balanced identity and expression encoders, an improved mid - point sampling method, and an explicit background condition design, this model improves the quality and controllability of the generated images while maintaining high customizability. In summary, the main contributions of the paper are: - Proposing a new face generation framework that achieves simultaneous control of identity and expression and finer - grained expression synthesis. - Defining a new face manipulation task - simultaneous face swapping and reenactment, which has not been explored by previous methods. - Proposing three innovative designs in the conditional diffusion model, which increase the controllability of the model and the image quality.