Abstract:Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.

What problem does this paper attempt to address?

This paper mainly discusses the performance gap between real datasets and synthetic datasets in the field of face recognition. With the development of deep learning and computing power, face recognition systems have become more advanced, but they also face privacy and ethical issues, leading to the withdrawal of some large real datasets. To address this problem, researchers have turned to generative artificial intelligence to create synthetic datasets, but the performance of these datasets has not reached the level of models trained on real data. The researchers extensively annotated two real datasets and two synthetic datasets using a large-scale attribute classifier (MAC), and then analyzed the distribution of these attributes in different datasets. They found that while synthetic data can explain the distribution of real data, the reverse is not true. By comparing the Kullback-Leibler divergence between different datasets, the differences between real and synthetic samples are revealed. The paper also discusses the progress in previous work on automatic annotation strategies, synthetic data generation, and the use of soft biometric features (such as gender, age, and race) for diversity analysis. The experimental design includes the annotation process, comparison methods, and experimental settings for different datasets. The research results show that synthetic data performs poorly on certain specific attributes (such as gender and smile detection), and there are significant differences with real data in overall diversity. In conclusion, the paper aims to understand how synthetic data simulates the distribution of real data, and through comparison and analysis, identify the gaps between the two, providing a basis for improving synthetic data generation methods to enhance the performance of face recognition systems in the future.

Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition

The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition

SynFace: Face Recognition with Synthetic Data

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces

Face Recognition Using Synthetic Face Data

SDFR: Synthetic Data for Face Recognition Competition

Synthetic Data for Face Recognition: Current State and Future Prospects

Synthetic Data for the Mitigation of Demographic Biases in Face Recognition

GANDiffFace: Controllable Generation of Synthetic Datasets for Face Recognition with Realistic Variations

On the use of automatically generated synthetic image datasets for benchmarking face recognition

Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models

Toward Fairer Face Recognition Datasets

Bias and Diversity in Synthetic-based Face Recognition

VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes

AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark

Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization

TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces

Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets

SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition