Improving Text-to-Image Consistency via Automatic Prompt Optimization

Oscar Mañas,Pietro Astolfi,Melissa Hall,Candace Ross,Jack Urbanek,Adina Williams,Aishwarya Agrawal,Adriana Romero-Soriano,Michal Drozdzal

2024-03-26

Abstract:Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

This paper addresses the problem of text-to-image generation models (T2I) in generating images consistent with input prompts. Existing methods often require fine-tuning the model, focusing only on nearby prompt samples, and there exists a trade-off between image quality, representation diversity, and prompt-image consistency. The study proposes a new framework called OPT2I that utilizes a large language model (LLM) to improve prompt-image consistency of T2I models by optimizing the prompts. The framework starts with user prompts and iteratively generates revised prompts to maximize the consistency score. Experiments show that OPT2I can improve consistency while maintaining FID, and increase the recall between generated and real data.

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Optimizing Prompts for Text-to-Image Generation

TIPO: Text to Image with Text Presampling for Prompt Optimization

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Dynamic Prompt Optimizing for Text-to-Image Generation

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

PromptCoT: Align Prompt Distribution Via Adapted Chain-of-Thought

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding

Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

Universal Prompt Optimizer for Safe Text-to-Image Generation

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation

Optimizing Prompts Using In-Context Few-Shot Learning for Text-to-Image Generative Models