Abstract:Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

Coherent Zero-Shot Visual Instruction Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Visually Dehallucinative Instruction Generation

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Generating Illustrated Instructions

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Compositional Zero-shot Learning Via Progressive Language-based Observations

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

Instruct Pix-to-3D: Instructional 3D Object Generation from a Single Image

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

VGDIFFZERO: Text-To-Image Diffusion Models Can Be Zero-Shot Visual Grounders.

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

Instruct-Imagen: Image Generation with Multi-modal Instruction

Generative Visual Instruction Tuning

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models