In-Context Learning Unlocked for Diffusion Models

Zhendong Wang,Yifan Jiang,Yadong Lu,Yelong Shen,Pengcheng He,Weizhu Chen,Zhangyang Wang,Mingyuan Zhou
2023-10-19
Abstract:We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at <a class="link-external link-https" href="https://github.com/Zhendong-Wang/Prompt-Diffusion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of how to unlock contextual learning capabilities for diffusion models. Specifically, the authors propose a new framework called Prompt Diffusion, which enables diffusion-based generative models to possess contextual learning abilities. By providing a pair of example images and textual guidance for a specific task, Prompt Diffusion can understand the underlying task and perform the same task on new query images based on this guidance. This approach is not only applicable to trained task types but can also effectively generalize to new, unseen task types. The main contributions of the paper include: 1. Proposing a novel visual-language prompt design that supports the integration of various visual-language tasks. 2. Developing the Prompt Diffusion model, which is the first versatile visual-language foundation model capable of contextual learning. 3. Demonstrating high-quality contextual generation results on both trained tasks and new tasks.