T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation Via Fine-Tuned Large Multi-Modal Models
Haoning Wu,Xiele Wu,Chunyi Li,Zicheng Zhang,Chaofeng Chen,Xiaohong Liu,Guangtao Zhai,Weisi Lin
DOI: https://doi.org/10.1145/3664647.3680939
2024-01-01
Abstract:Text-to-image (T2I) generation is a pivotal and core interest within the realm of AI content generation. Amid the swift advancements of both open-source (such as Stable Diffusion) and proprietary (for example, DALLE, MidJourney) T2I models, there is a notable absence of a comprehensive and robust quantitative framework for evaluating their output quality. Traditional methods of quality assessment overlook the textual prompts when judging images; meanwhile, the advent of large multi-modal models (LMMs) introduces the capability to incorporate text prompts in evaluations, yet the challenge of fine-tuning these models for precise T2I quality assessment remains unresolved. In our study, we introduce the T2I-Scorer, a novel two-stage training methodology aimed at fine-tuning LMMs for T2I evaluation. For the first stage, we collect 397K GPT-4V-labeled question-answer pairs related to T2I evaluation. Termed as T2I-ITD, the pseudo-labeled dataset is analyzed and examined by human, and used for instruction tuning to improve the LMM's low-level quality perception. The first stage model, T2I-Scorer-IT, has reached superior accuracy on T2I evaluation than all kinds of existing T2I metrics under zero-shot settings. For the second stage, we define an explicit multi-task training scheme to further align the LMM with human opinion scores, and the fine-tuned T2I-Scorer can reach state-of-the-art accuracy on both image quality and image-text alignment perspectives with significant improvements. We anticipate the proposed metrics can serve as a reliable metric to gauge the ability of T2I generation models in the future. We will make code, data, and weights publicly available.