Abstract:Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., $1.73\times10^{13}$), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets ($\sim 1000$). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to solve the problem of how to use Generative Artificial Intelligence (GenAI) to generate personalized model parameters from text prompts, thereby achieving train-once-for-all personalization. Specifically, the researchers explore whether GenAI can understand high-level knowledge embedded in AI model parameters and generate personalized models for different user and task needs through text-conditioned Neural Network Diffusion. ### Main Contributions 1. **Exploring the Potential of GenAI in Generating Personalized Models**: - The researchers propose the concept of text-to-model generation, which directly generates model parameters from text prompts to meet the personalized needs of different users. - This is the first study to apply text prompts as conditions in neural network diffusion. 2. **Proposing the Tina Framework**: - Tina is a text-conditioned neural network diffusion model for train-once-for-all personalization. - Tina can be trained on small datasets and generalize to unseen tasks and entities (categories). 3. **Analyzing Tina's Capabilities and Boundaries**: - Experiments validate Tina's performance in different length classification tasks, zero-shot, and few-shot image prompts. - The study explores Tina's understanding of world knowledge, including natural language descriptions and predictions of unseen entities. ### Method Overview - **Problem Definition**: - Task k is defined as a classification task on subset Yk. - The goal is to learn a neural network predictor fθk with parameters θk. - Given task description tk, generate model parameters θk. - **Framework**: - Tina combines Diffusion Transformer (DiT) and CLIP encoder to generate personalized models from text prompts. - During training, the CLIP text encoder encodes the text to generate noise parameters, which are then denoised step by step to restore the original distribution. - During inference, Tina can accept image prompts and use the CLIP image encoder for generation. - **Data Preparation**: - Training data is divided into two stages: first, train a general model on a large-scale dataset, then fine-tune the general model for personalized tasks to generate personalized models (p-Models). - Each data sample contains a pair of “(task description, p-Model)”. ### Experimental Results - **Performance on Different Datasets and Model Architectures**: - Experiments were conducted on Mini-ImageNet, CIFAR-100, and Caltech-101 datasets. - Results show that Tina significantly outperforms baseline methods (such as general models, classifier selection, and TAPER-Mixer) in both in-distribution and out-of-distribution personalization tasks. - **Parameter Inheritance**: - It was verified that inheriting parameters from the pre-trained model G.pt helps Tina converge faster, although the final performance is similar. - **Image Prompt Training**: - Tina was trained using the CLIP image encoder, and results show that text-prompted Tina converges faster, although the final performance is similar. - **Capability Analysis of Different Prompt Schemes**: - Tina's performance was validated in zero-shot and few-shot image prompts, different numbers of personalized categories, and natural language description prompts. ### Conclusion Tina demonstrates great potential in the field of text-to-model generation, especially in train-once-for-all personalization tasks. Through text prompts, Tina can generate personalized models for different user and task needs and performs well on small datasets. Future research can further explore Tina's applications in more complex and challenging scenarios.

Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

Emage: Non-Autoregressive Text-to-Image Generation

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Customization Assistant for Text-to-image Generation

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Self-conditioned Embedding Diffusion for Text Generation

Multi-Concept Customization of Text-to-Image Diffusion

Automatic Conditional Generation of Personalized Social Media Short Texts

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Diffusion models in text generation: a survey

Neural personalized response generation as domain adaptation

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation

TextCraftor: Your Text Encoder Can be Image Quality Controller

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

ECNet: Effective Controllable Text-to-Image Diffusion Models

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning