ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Chengyou Jia,Changliang Xia,Zhuohang Dang,Weijia Wu,Hangwei Qian,Minnan Luo
2024-11-26
Abstract:Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \url{<a class="link-external link-https" href="https://chengyou-jia.github.io/ChatGen-Home" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the challenges faced by text - to - image (T2I) generation models in practical applications. Specifically, when using T2I models, users often need to go through a cumbersome trial - and - error process, including the following aspects: 1. **Writing appropriate prompts**: Users need to carefully design prompts to describe the content of the images they want to generate. 2. **Selecting an appropriate model**: Select the most suitable model for the current need from among the many available T2I models. 3. **Configuring specific arguments**: Configure appropriate parameters for the selected model to obtain the best generation results. These steps are complex and full of uncertainties, making it extremely difficult for non - professional users to generate the desired images, similar to "a mouse in a maze". To simplify this process, the paper proposes an **Automatic Text - to - Image (Automatic T2I)** method, allowing users to simply describe their needs in a natural - conversation way, and the system can automatically generate the required images. ### Main contributions of the paper 1. **Proposing new challenging problems**: Develop an automatic T2I model that can handle users' free - conversation inputs and automatically generate all necessary components (prompts, models, and parameters). 2. **Introducing the ChatGenBench benchmark**: This is a benchmark dataset specifically designed for automatic T2I, containing a large amount of high - quality paired data, supporting multi - modality and historical inputs, and used to gradually evaluate automatic T2I models. 3. **Proposing the ChatGen - Evo framework**: Adopt a multi - stage evolution strategy to train multi - modal large - language models (MLLM), by decomposing tasks into multiple stages, gradually endowing the model with the necessary automatic skills. 4. **Extensive experimental verification**: Through a comprehensive evaluation of ChatGenBench, demonstrate the superior performance of ChatGen - Evo on various indicators, and provide valuable insights, providing a direction for the further development of automatic T2I. Through these contributions, the paper not only solves the problem that existing methods can only partially automate the T2I process, but also significantly improves the efficiency and quality of automatic T2I.