Abstract:Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limited application of current generative text - to - image models in standard visual recognition tasks. Although these models have made remarkable progress in generating realistic and diverse images, they have not been fully exploited as the basis for computer vision tasks. Existing methods usually design specialized model architectures and loss functions for each specific task, which limits the generalization ability of the model across different problem domains and datasets. To overcome this limitation, the authors propose a unified language interface - InstructCV - for computer vision tasks. By transforming multiple computer vision tasks into text - to - image generation problems, InstructCV can execute tasks using natural language instructions. Specifically, the text representation describes the instructions of the task, and the generated image is the visually - encoded task output. This method not only simplifies the task execution process but also enhances the model's generalization ability for new data, categories, and user instructions. The key contributions of the paper include: 1. **Constructing a multimodal, multi - task instruction - tuned dataset**: The authors integrated multiple commonly - used computer vision datasets, covering multiple tasks such as segmentation, object detection, depth estimation, and classification. By using a large - language model (LLM) to restate the prompt templates, a multimodal and multi - task training dataset containing input images, output images, and annotation instructions was created. 2. **Instruction - tuning based on the InstructPix2Pix architecture**: Using the constructed dataset, the pre - trained conditional diffusion model (such as Stable Diffusion) was instruction - tuned, transforming its function from a generative model to an instruction - guided multi - task visual learner. 3. **Experimental verification**: The experimental results show that InstructCV performs excellently in multiple visual tasks, especially demonstrating strong generalization ability on unseen datasets, categories, and user instructions. In conclusion, this paper proposes a new method that applies generative text - to - image models to multiple computer vision tasks through natural language instructions, improving the model's generalization ability and practicality.

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Sketch-Guided Text-to-Image Diffusion Models

Are Diffusion Models Vision-And-Language Reasoners?

Improving Diffusion Models for Scene Text Editing with Dual Encoders

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Instruct Pix-to-3D: Instructional 3D Object Generation from a Single Image

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Unleashing Text-to-Image Diffusion Models for Visual Perception

TextCraftor: Your Text Encoder Can be Image Quality Controller

Exploring Vision Transformers as Diffusion Learners

Pix2Video: Video Editing using Image Diffusion

Conditional Text-to-Image Generation with Reference Guidance

SEGA: Instructing Text-to-Image Models using Semantic Guidance

Text-to-image Diffusion Models in Generative AI: A Survey

Diffusion Self-Distillation for Zero-Shot Customized Image Generation