Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana,Srikumar Ramalingam,Andreas Veit,Daniel Glasner,Ayan Chakrabarti,Sanjiv Kumar

2024-01-26

Abstract:As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the shortcomings of the existing Fréchet Inception Distance (FID) metric in the evaluation of image generation models. Specifically: 1. **Inconsistency with human evaluators**: FID sometimes does not align with human judgments, especially when evaluating text-to-image models. 2. **Statistical assumption errors**: FID assumes that the Inception feature distribution is a multivariate normal distribution, but the actual data does not meet this assumption, leading to potentially incorrect evaluation results. 3. **Sample inefficiency**: FID requires a large number of samples to reliably estimate the covariance matrix, which limits its application on small sample sets. 4. **Inability to capture gradual improvements**: For iterative text-to-image models, FID fails to reflect the phenomenon of image quality gradually improving with iterations. To address these issues, the authors propose a new evaluation metric, CMMD (CLIP-Maximum Mean Discrepancy), based on CLIP embeddings and maximum mean discrepancy distance, aiming to provide a more reliable and robust method for image quality assessment. Through extensive experimental validation, CMMD outperforms FID in multiple aspects, better reflecting the true quality changes of images.

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Reviewing FID and SID Metrics on Generative Adversarial Networks

Using Skew to Assess the Quality of GAN-generated Image Features

The Role of ImageNet Classes in Fréchet Inception Distance

Evaluation Metrics for Conditional Image Generation

A Study on the Evaluation of Generative Models

Compound Frechet Inception Distance for Quality Assessment of GAN Created Images

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend

A Distributional Evaluation of Generative Image Models

F?D: On understanding the role of deep feature spaces on face generation evaluation

Normalizing Flow-Based Metric for Image Generation

Frećhet Denoised Distance: Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

Evaluating Text-to-Image GANs Performance: A Comparative Analysis of Evaluation Metrics

FLD+: Data-efficient Evaluation Metric for Generative Models

Improving Sample-based Evaluation for Generative Adversarial Networks

A study of the evaluation metrics for generative images containing combinational creativity

An Optimism-based Approach to Online Evaluation of Generative Models

On Aliased Resizing and Surprising Subtleties in GAN Evaluation

An Improved Evaluation Framework for Generative Adversarial Networks.