Painter: Teaching Auto-regressive Language Models to Draw Sketches

Reza Pourreza,Apratim Bhattacharyya,Sunny Panchal,Mingu Lee,Pulkit Madan,Roland Memisevic
DOI: https://doi.org/10.48550/arXiv.2308.08520
2023-08-17
Abstract:Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to apply large - language models (LLMs) to image - generation tasks, especially drawing images by generating strokes of virtual paintbrushes. Specifically, the paper introduces the Painter model, an LLM - based system that can automatically generate sketches according to text descriptions. Different from existing image - generation methods, Painter imitates the way humans paint and completes an image by autoregressively generating a series of strokes. ### Main Problems 1. **Applying LLMs to Image Generation**: Although existing image - generation methods have achieved remarkable results, they lack interpretability and it is difficult to solve their inherent flaws. Painter provides a new image - generation method by using LLMs to automatically generate strokes to draw images, which is closer to the process of human painting. 2. **Multi - object Sketch Generation**: Existing datasets such as Quick - Draw only contain sketches of single objects and lack detailed text descriptions. The paper creates a new dataset, Multi - Object - Quick - Draw, which contains sketches of multiple objects and their detailed relationship and position labels, in order to train Painter to generate more complex multi - object sketches. 3. **Multi - task Ability**: Besides generating sketches, Painter can also perform other tasks, such as completing incomplete sketches, removing objects from the canvas, reproducing given sketches, and detecting and classifying objects in sketches. The introduction of these tasks aims to improve the performance of the model on the main task and increase its versatility. ### Solutions 1. **Dataset Construction**: Created the Multi - Object - Quick - Draw dataset, which contains diverse multi - object sketches and their corresponding text descriptions. These sketches not only contain single objects but also include relationship and relative position labels between multiple objects. 2. **Model Design**: Modify the existing pre - trained LLM, add residual cross - attention layers, so that it can handle intertwined inputs of text and image. In addition, introduce a visual feedback loop, enabling the model to monitor the state of the canvas in real - time during the generation process. 3. **Training Method**: Use the standard masked cross - entropy loss function to supervise the training of the model, ensuring that the model can accurately understand the text description and generate corresponding strokes when generating sketches. ### Contributions 1. **First Use of LLMs for Autoregressive Image Generation**: Painter is the first model to use LLMs for autoregressive image generation, pioneering in this field. 2. **Creation of a New Dataset**: The Multi - Object - Quick - Draw dataset contains diverse multi - object sketches and their detailed relationship and position labels, providing rich resources for training complex image - generation models. 3. **Enhanced Visual Grounding**: By introducing a visual feedback loop, cross - attention layers and multi - task training, the performance and interpretability of the model in image - generation tasks are improved. In conclusion, this paper solves the challenges of applying LLMs to image - generation tasks by introducing the Painter model and shows the potential of this method in generating complex multi - object sketches.