StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

Pin-Yen Chiu,Dai-Jie Wu,Po-Hsun Chu,Chia-Hsuan Hsu,Hsiang-Chen Chiu,Chih-Yu Wang,Jun-Cheng Chen
2024-12-14
Abstract:Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in relative facial image synthesis: 1. **Data scarcity and low quality**: Existing relative facial datasets usually have problems such as insufficient quantity, low resolution, and poor quality, which makes it difficult to generate high - fidelity offspring facial images. 2. **Balance between diversity and fidelity**: Existing methods face challenges in generating relative facial images with high diversity and high fidelity, especially in precisely controlling facial attributes (such as age and gender). Specifically: - Some methods generate images with insufficient diversity, resulting in the generated children's facial images looking too similar. - Other methods can generate diverse human faces, but perform poorly in maintaining a high similarity with the facial features of the parents. 3. **Independent control of influencing conditions**: When generating relative facial images, it is necessary to be able to independently control each influencing factor (for example, the facial images of each parent) in order to achieve the optimal balance between diversity and fidelity. 4. **New task of partner facial prediction**: Previous research has mainly focused on generating children's facial images from the facial images of parents, while the task of predicting the facial image of one parent from the facial image of a child and the facial image of the other parent has not been fully explored. To solve these problems, the paper proposes the StyleDiT framework, which combines the fine - grained facial attribute control of StyleGAN and the powerful generation ability of the diffusion model to achieve the following goals: - **Generate high - quality and diverse relative facial images**: By fusing the style latent space of StyleGAN and the advantages of the diffusion model, generate relative facial images with both high fidelity and diversity. - **Precisely control facial attributes**: By introducing the Relational Trait Guidance (RTG) mechanism, allow users to independently control each influencing condition, thereby better adjusting the diversity and fidelity of the generated images. - **Expand to partner facial prediction**: For the first time, attempt to predict the facial image of one parent from the facial image of a child and the facial image of the other parent, providing new possibilities for applications such as criminal investigations and searching for missing persons. ### Formula summary - **Generation process formula**: \[ I_{\text{out}} = F(I_{\text{in1}}, I_{\text{in2}}, \alpha, \beta) \] where \(I_{\text{in1}}\) and \(I_{\text{in2}}\) are the input facial images of the father or mother, \(\alpha\) is the age, \(\beta\) is the gender, and \(I_{\text{out}}\) is the generated facial image. - **Forward noise process of the diffusion model**: \[ q(x_t|x_0)=\mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0,(1 - \bar{\alpha}_t)I) \] \[ x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1 - \bar{\alpha}_t}\epsilon_t \] - **Reverse denoising process**: \[ p_\theta(x_{t - 1}|x_t)=\mathcal{N}(\mu_\theta(x_t),\Sigma_\theta(x_t)) \] - **Loss function**: \[ L(\theta)=-p(x_0|x_1)+\sum_{t}D_{\text{KL}}(q^*(x_{t - 1}|x_t,x_0)||p_\theta(x_{t - 1}|x_t)) \] - **Diversity evaluation formula**: \[ DS=\frac{1}{N(N - 1)}\sum_{i\neq j}\frac{x_i\cdot x_j}{\|x_i\|\|x_j\|} \]