Abstract:Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at <a class="link-external link-https" href="https://github.com/showlab/EvolveDirector" rel="external noopener nofollow">this https URL</a>.

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Emage: Non-Autoregressive Text-to-Image Generation

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Parallelized Autoregressive Visual Generation

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Randomized Autoregressive Visual Generation

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Many-to-many Image Generation with Auto-regressive Diffusion Models