PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Weifeng Lin,Xinyu Wei,Renrui Zhang,Le Zhuo,Shitian Zhao,Siyuan Huang,Junlin Xie,Yu Qiao,Peng Gao,Hongsheng Li
2024-10-06
Abstract:This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at <a class="link-external link-https" href="https://github.com/AFeng-x/PixWizard" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to construct a general - purpose and interactive image - to - image visual assistant (PixWizard), enabling it to perform multiple image generation, manipulation, and translation tasks according to free - form language instructions. Specifically, PixWizard aims to: 1. **Unified Task Processing**: Integrate various visual tasks (such as text - to - image generation, image inpainting, image alignment, dense image prediction, image editing, controllable generation, patching/extrapolation, etc.) into a unified image - text - image generation framework, so that the model can handle diverse visual tasks. 2. **Dataset Construction**: Create a comprehensive Omni Pixel - to - Pixel Instruction - Tuning Dataset, which contains 30 million training data and covers five main capabilities: image generation, image editing, image inpainting, image localization, and dense image prediction. This helps to improve the generalization ability of the model on different tasks. 3. **Architecture Design**: Adopt an architecture based on Diffusion Transformer (DiT) to enhance the flexibility and stability of the model. By introducing a dynamic partitioning and padding scheme, the model can handle input images of arbitrary resolutions; at the same time, add a structure - aware and semantics - aware guidance mechanism to help the model better understand multi - modal instructions (images and natural language instructions). 4. **Open - Language Instruction Support**: Enhance the model's ability to understand free - form user prompts, enabling it to perform various image operations according to natural language instructions. Manually write and use GPT - 4 to generate a large number of variant instructions to ensure the diversity and accuracy of the instruction set. 5. **Two - Stage Training Strategy**: In order to make full use of the limited dataset, a two - stage training and data - balancing strategy is proposed. In the first stage, the task weights of small datasets are increased to balance the data volume, and in the second stage, all data are combined for further training to improve the overall performance of the model. Through these methods, PixWizard not only demonstrates excellent image generation and understanding abilities but also shows good generalization abilities on unseen tasks and human instructions, becoming a powerful interactive image - to - image visual assistant. ### Involved Formulas The loss function mentioned in the paper is: \[ L=\mathbb{E}_{t, p_1(x_1), p_t(x_t|x_1), c_I, c_T}\left\|v_\theta(x_t, t, c_I, c_T)-u_t(x_t, t|x_1)\right\|^2 \] where \(v_\theta\) is the predicted velocity field, \(u_t\) is the true velocity field under the condition of the given initial image \(x_1\), and \(c_I\) and \(c_T\) are the image condition and text - instruction condition, respectively. In addition, the final output in the attention mechanism is: \[ A = \text{softmax}\left(\frac{\tilde{Q}_i\tilde{K}_i^T}{\sqrt{d}}\right)V_i+\tanh(\alpha_t)\text{softmax}\left(\frac{\tilde{Q}_iK_t^T}{\sqrt{d}}\right)V_t+\tanh(\alpha_{ci})\text{softmax}\left(\frac{\tilde{Q}_iK_{ci}^T}{\sqrt{d}}\right)V_{ci} \] where \(\tilde{Q}_i\) and \(\tilde{K}_i\) represent the query and key after applying the position encoding of RoPE, \(d\) is the dimension of the query and key, and \(\alpha_t\) and \(\alpha_{ci}\) are zero - initialized learning parameters. These formulas ensure the efficiency and accuracy of the model when handling complex image tasks.