X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models

Yixiong Chen,Li Liu,Chris Ding
2023-05-26
Abstract:This paper introduces a novel explainable image quality evaluation approach called X-IQE, which leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations. X-IQE utilizes a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-consistent, unbiased texts that are highly correlated with human evaluation. It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning. X-IQE is more cost-effective and efficient compared to human evaluation, while significantly enhancing the transparency and explainability of deep image quality evaluation models. We validate the effectiveness of our method as a benchmark using images generated by prevalent diffusion models. X-IQE demonstrates similar performance to state-of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the limitations of previous evaluation models on DrawBench, particularly in handling ambiguous generation prompts and text recognition in generated images. Project website: <a class="link-external link-https" href="https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address several key issues in image quality assessment. Specifically: 1. **Limitations of existing methods**: Current manual assessment methods are costly and have poor reproducibility; model-based assessment methods require complex models and specially annotated data, and lack the generalization ability of humans. 2. **Interpretability and transparency**: Existing model assessment methods often focus only on predicting image quality scores, making it difficult to explain biases and defects in the training data, leading to poor model performance. The paper proposes a new interpretable image quality assessment method, X-IQE, which uses pre-trained visual large language models (such as MiniGPT-4) to generate image analysis text, thereby achieving a comprehensive assessment of image quality. X-IQE has the following advantages: - **Interpretability**: Generates descriptions of the reasoning process through Chain of Thought (CoT). - **Comprehensiveness**: Designed prompts can conduct multi-faceted assessments, not limited to specific features. - **Strong performance**: Utilizes the powerful generalization ability of large-scale language models. - **Unbiasedness**: Eliminates biases introduced by dataset annotations through objective prompt texts. - **No training required**: Leverages the capabilities of pre-trained models without the need for additional data collection and training. Extensive experiments have validated the effectiveness of X-IQE on real and AI-generated images, demonstrating its potential as a benchmark for evaluating text-to-image generation models.