Analyzing the Feature Extractor Networks for Face Image Synthesis

Erdi Sarıtaş,Hazım Kemal Ekenel
2024-06-04
Abstract:Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at <a class="link-external link-https" href="https://github.com/ThEnded32/AnalyzingFeatureExtractors" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily explores the performance of different feature extraction networks in the evaluation of facial image synthesis and attempts to address the following issues: 1. **Effectiveness of Evaluation Metrics**: The current main method for assessing the authenticity of generated facial images is to use the InceptionV3 model to calculate FID (Fréchet Inception Distance). However, the limitations of the InceptionV3 model on facial images have raised questions about its reliability. Therefore, the paper aims to comprehensively evaluate the performance of these models in facial image synthesis tasks through various feature extraction networks (such as InceptionV3, CLIP, DINOv2, and ArcFace) and multiple evaluation metrics (such as FID, KID, Precision & Recall). 2. **Choice of Feature Extraction Networks**: Although InceptionV3 is one of the most commonly used feature extraction networks, its performance on facial images is not ideal. Therefore, the paper examines several other different feature extraction networks, including CLIP, DINOv2, and ArcFace, to understand their behavioral differences in facial image synthesis evaluation. 3. **Performance on Different Datasets**: The paper uses two real datasets (FFHQ and CelebA-HQ) and two synthetic datasets generated by StyleGAN2 and Projected FastGAN for experiments, aiming to verify whether the evaluation metrics can correctly distinguish between authentic and high-quality images. In summary, the paper aims to provide valuable insights for subsequent research by comprehensively analyzing the methods for evaluating the authenticity and quality of facial image synthesis through various feature extraction networks and evaluation metrics.