Abstract:This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at <a class="link-external link-https" href="https://github.com/AFeng-x/PixWizard" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to construct a general - purpose and interactive image - to - image visual assistant (PixWizard), enabling it to perform multiple image generation, manipulation, and translation tasks according to free - form language instructions. Specifically, PixWizard aims to: 1. **Unified Task Processing**: Integrate various visual tasks (such as text - to - image generation, image inpainting, image alignment, dense image prediction, image editing, controllable generation, patching/extrapolation, etc.) into a unified image - text - image generation framework, so that the model can handle diverse visual tasks. 2. **Dataset Construction**: Create a comprehensive Omni Pixel - to - Pixel Instruction - Tuning Dataset, which contains 30 million training data and covers five main capabilities: image generation, image editing, image inpainting, image localization, and dense image prediction. This helps to improve the generalization ability of the model on different tasks. 3. **Architecture Design**: Adopt an architecture based on Diffusion Transformer (DiT) to enhance the flexibility and stability of the model. By introducing a dynamic partitioning and padding scheme, the model can handle input images of arbitrary resolutions; at the same time, add a structure - aware and semantics - aware guidance mechanism to help the model better understand multi - modal instructions (images and natural language instructions). 4. **Open - Language Instruction Support**: Enhance the model's ability to understand free - form user prompts, enabling it to perform various image operations according to natural language instructions. Manually write and use GPT - 4 to generate a large number of variant instructions to ensure the diversity and accuracy of the instruction set. 5. **Two - Stage Training Strategy**: In order to make full use of the limited dataset, a two - stage training and data - balancing strategy is proposed. In the first stage, the task weights of small datasets are increased to balance the data volume, and in the second stage, all data are combined for further training to improve the overall performance of the model. Through these methods, PixWizard not only demonstrates excellent image generation and understanding abilities but also shows good generalization abilities on unseen tasks and human instructions, becoming a powerful interactive image - to - image visual assistant. ### Involved Formulas The loss function mentioned in the paper is: \[ L=\mathbb{E}_{t, p_1(x_1), p_t(x_t|x_1), c_I, c_T}\left\|v_\theta(x_t, t, c_I, c_T)-u_t(x_t, t|x_1)\right\|^2 \] where \(v_\theta\) is the predicted velocity field, \(u_t\) is the true velocity field under the condition of the given initial image \(x_1\), and \(c_I\) and \(c_T\) are the image condition and text - instruction condition, respectively. In addition, the final output in the attention mechanism is: \[ A = \text{softmax}\left(\frac{\tilde{Q}_i\tilde{K}_i^T}{\sqrt{d}}\right)V_i+\tanh(\alpha_t)\text{softmax}\left(\frac{\tilde{Q}_iK_t^T}{\sqrt{d}}\right)V_t+\tanh(\alpha_{ci})\text{softmax}\left(\frac{\tilde{Q}_iK_{ci}^T}{\sqrt{d}}\right)V_{ci} \] where \(\tilde{Q}_i\) and \(\tilde{K}_i\) represent the query and key after applying the position encoding of RoPE, \(d\) is the dimension of the query and key, and \(\alpha_t\) and \(\alpha_{ci}\) are zero - initialized learning parameters. These formulas ensure the efficiency and accuracy of the model when handling complex image tasks.

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Instruct Pix-to-3D: Instructional 3D Object Generation from a Single Image

Pix2Code: Learning to Compose Neural Visual Concepts as Programs

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

General Image-to-Image Translation with One-Shot Image Guidance

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

InstructGIE: Towards Generalizable Image Editing

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Towards Open-World Text-Guided Face Image Generation and Manipulation