Abstract:Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \url{<a class="link-external link-https" href="https://chengyou-jia.github.io/ChatGen-Home" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the challenges faced by text - to - image (T2I) generation models in practical applications. Specifically, when using T2I models, users often need to go through a cumbersome trial - and - error process, including the following aspects: 1. **Writing appropriate prompts**: Users need to carefully design prompts to describe the content of the images they want to generate. 2. **Selecting an appropriate model**: Select the most suitable model for the current need from among the many available T2I models. 3. **Configuring specific arguments**: Configure appropriate parameters for the selected model to obtain the best generation results. These steps are complex and full of uncertainties, making it extremely difficult for non - professional users to generate the desired images, similar to "a mouse in a maze". To simplify this process, the paper proposes an **Automatic Text - to - Image (Automatic T2I)** method, allowing users to simply describe their needs in a natural - conversation way, and the system can automatically generate the required images. ### Main contributions of the paper 1. **Proposing new challenging problems**: Develop an automatic T2I model that can handle users' free - conversation inputs and automatically generate all necessary components (prompts, models, and parameters). 2. **Introducing the ChatGenBench benchmark**: This is a benchmark dataset specifically designed for automatic T2I, containing a large amount of high - quality paired data, supporting multi - modality and historical inputs, and used to gradually evaluate automatic T2I models. 3. **Proposing the ChatGen - Evo framework**: Adopt a multi - stage evolution strategy to train multi - modal large - language models (MLLM), by decomposing tasks into multiple stages, gradually endowing the model with the necessary automatic skills. 4. **Extensive experimental verification**: Through a comprehensive evaluation of ChatGenBench, demonstrate the superior performance of ChatGen - Evo on various indicators, and provide valuable insights, providing a direction for the further development of automatic T2I. Through these contributions, the paper not only solves the problem that existing methods can only partially automate the T2I process, but also significantly improves the efficiency and quality of automatic T2I.

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Diversified text-to-image generation via deep mutual information estimation

Emage: Non-Autoregressive Text-to-Image Generation

Customization Assistant for Text-to-image Generation

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

RealtimeGen: an Intervenable AI Image Generation System for Commercial Digital Art Asset Creators

Text-to-Image Synthesis: A Decade Survey

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

Improving face generation quality and prompt following with synthetic captions

Teaching Text-to-Image Models to Communicate.

Unified Text-to-Image Generation and Retrieval

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights