Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition

Pedro C. Neto,Rafael M. Mamede,Carolina Albuquerque,Tiago Gonçalves,Ana F. Sequeira
2024-04-24
Abstract:Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the performance gap between real datasets and synthetic datasets in the field of face recognition. With the development of deep learning and computing power, face recognition systems have become more advanced, but they also face privacy and ethical issues, leading to the withdrawal of some large real datasets. To address this problem, researchers have turned to generative artificial intelligence to create synthetic datasets, but the performance of these datasets has not reached the level of models trained on real data. The researchers extensively annotated two real datasets and two synthetic datasets using a large-scale attribute classifier (MAC), and then analyzed the distribution of these attributes in different datasets. They found that while synthetic data can explain the distribution of real data, the reverse is not true. By comparing the Kullback-Leibler divergence between different datasets, the differences between real and synthetic samples are revealed. The paper also discusses the progress in previous work on automatic annotation strategies, synthetic data generation, and the use of soft biometric features (such as gender, age, and race) for diversity analysis. The experimental design includes the annotation process, comparison methods, and experimental settings for different datasets. The research results show that synthetic data performs poorly on certain specific attributes (such as gender and smile detection), and there are significant differences with real data in overall diversity. In conclusion, the paper aims to understand how synthetic data simulates the distribution of real data, and through comparison and analysis, identify the gaps between the two, providing a basis for improving synthetic data generation methods to enhance the performance of face recognition systems in the future.