Abstract:We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems existing in the evaluation metrics of existing generative models (such as Fréchet Inception Distance, FID) when evaluating the quality of generated images: 1. **Lack of reliability**: Existing evaluation metrics, especially FID, exhibit non - monotonic behavior when faced with different types of image degradations (such as noise, occlusion, diffusion steps, etc.), that is, as the degree of degradation increases, the evaluation results do not always deteriorate, which leads to the unreliability of evaluation. 2. **Low data efficiency**: Metrics such as FID require a large number of samples (usually more than 20,000 images) to produce stable and reliable evaluation results. This is a huge challenge for fields with scarce data or situations with limited computing resources. 3. **Low computational efficiency**: FID relies on the pre - trained InceptionV3 network, which not only increases the computational burden but also limits its flexibility to adapt to new fields. In addition, FID is especially time - consuming when processing high - resolution images. 4. **Poor adaptability to specific fields**: FID and other metrics based on ImageNet pre - trained networks perform poorly when evaluating generated images in specific fields (such as medical images), because these networks are mainly optimized for the ImageNet dataset and cannot well capture the features of other fields. To solve these problems, the paper proposes a new evaluation metric - Flow - based Likelihood Distance Plus (FLD+). FLD+ improves existing metrics in the following ways: - **Based on normalizing flows**: FLD+ uses normalizing flows to calculate the likelihood value of the generated image relative to the real image distribution, thus providing a more accurate evaluation. - **Data and computationally efficient**: FLD+ reduces the number of required samples and improves computational efficiency by applying normalizing flows in the low - dimensional feature space. - **Highly adaptable**: FLD+ can be easily retrained to adapt to new fields (such as medical images) without requiring a large amount of data and computing resources. Overall, FLD+ aims to provide a more reliable, data - efficient, computationally - efficient and adaptable evaluation metric for generative models to better meet the needs of modern generative models.

FLD+: Data-efficient Evaluation Metric for Generative Models

Normalizing Flow-Based Metric for Image Generation

Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples

Generative Modeling with Flow-Guided Density Ratio Learning

A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Free-form Flows: Make Any Architecture a Normalizing Flow

Implicit Dynamical Flow Fusion (IDFF) for Generative Modeling

Flow Matching in Latent Space

VFlow: More Expressive Generative Flows with Variational Data Augmentation

Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection

Flow Matching for Generative Modeling

Evaluation Metric for Quality Control and Generative Models in Histopathology Images

Fisher Flow Matching for Generative Modeling over Discrete Data

Attentive Contractive Flow with Lipschitz-constrained Self-Attention

Discrete Denoising Flows

AmbientFlow: Invertible generative models from incomplete, noisy measurements

Normalizing Flows are Capable Generative Models

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

Kernelised Normalising Flows