MaskGIT: Masked Generative Image Transformer

Huiwen Chang,Han Zhang,Lu Jiang,Ce Liu,William T. Freeman
DOI: https://doi.org/10.48550/arXiv.2202.04200
2022-02-09
Abstract:Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the quality and efficiency of image generation. Specifically: 1. **Limitations of existing methods**: - Current state-of-the-art generative transformer models (such as VQ-GAN) treat images simply as a sequence and decode images in a raster scan order (i.e., row by row). This approach is neither efficient nor optimal. - Autoregressive models are very slow in generating high-resolution images because they need to generate each pixel or token in the image one by one. 2. **Proposed new method**: - The paper introduces a new bidirectional transformer decoder called MaskGIT (Masked Generative Image Transformer). During training, MaskGIT learns by predicting randomly masked tokens, which can come from all directions of the image. - During inference, MaskGIT generates all tokens in the image simultaneously and then progressively refines the image based on the previously generated results. This approach significantly improves generation speed while maintaining high-quality image generation. 3. **Main contributions**: - **Quality improvement**: Experiments show that MaskGIT significantly outperforms current state-of-the-art autoregressive transformer models in terms of generation quality on the ImageNet dataset. - **Speed improvement**: MaskGIT's decoding speed is 64 times faster than existing autoregressive models. - **Flexibility**: MaskGIT can be easily extended to various image editing tasks, such as image inpainting, extrapolation, and image manipulation. 4. **Specific application scenarios**: - **Class-conditional image generation**: MaskGIT can generate high-quality images given class conditions. - **Image inpainting**: MaskGIT can fill in missing parts of an image, generating natural and consistent images. - **Image extrapolation**: MaskGIT can generate content beyond the original image boundaries, achieving image expansion. - **Class-conditional image editing**: MaskGIT can regenerate specified areas of content based on given classes while keeping the background unchanged. In summary, by proposing MaskGIT, this paper aims to address the shortcomings of existing image generation methods in terms of quality and efficiency, providing a more efficient and flexible method for image generation and editing.