Abstract:Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the quality and efficiency of image generation. Specifically: 1. **Limitations of existing methods**: - Current state-of-the-art generative transformer models (such as VQ-GAN) treat images simply as a sequence and decode images in a raster scan order (i.e., row by row). This approach is neither efficient nor optimal. - Autoregressive models are very slow in generating high-resolution images because they need to generate each pixel or token in the image one by one. 2. **Proposed new method**: - The paper introduces a new bidirectional transformer decoder called MaskGIT (Masked Generative Image Transformer). During training, MaskGIT learns by predicting randomly masked tokens, which can come from all directions of the image. - During inference, MaskGIT generates all tokens in the image simultaneously and then progressively refines the image based on the previously generated results. This approach significantly improves generation speed while maintaining high-quality image generation. 3. **Main contributions**: - **Quality improvement**: Experiments show that MaskGIT significantly outperforms current state-of-the-art autoregressive transformer models in terms of generation quality on the ImageNet dataset. - **Speed improvement**: MaskGIT's decoding speed is 64 times faster than existing autoregressive models. - **Flexibility**: MaskGIT can be easily extended to various image editing tasks, such as image inpainting, extrapolation, and image manipulation. 4. **Specific application scenarios**: - **Class-conditional image generation**: MaskGIT can generate high-quality images given class conditions. - **Image inpainting**: MaskGIT can fill in missing parts of an image, generating natural and consistent images. - **Image extrapolation**: MaskGIT can generate content beyond the original image boundaries, achieving image expansion. - **Class-conditional image editing**: MaskGIT can regenerate specified areas of content based on given classes while keeping the background unchanged. In summary, by proposing MaskGIT, this paper aims to address the shortcomings of existing image generation methods in terms of quality and efficiency, providing a more efficient and flexible method for image generation and editing.

MaskGIT: Masked Generative Image Transformer

A Pytorch Reproduction of Masked Generative Image Transformer

GIT: A Generative Image-to-text Transformer for Vision and Language

MaskBit: Embedding-free Image Generation via Bit Tokens

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

GIVT: Generative Infinite-Vocabulary Transformers

Fast Training of Diffusion Models with Masked Transformers

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

M2T: Masking Transformers Twice for Faster Decoding

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

[MASK] is All You Need

Masked and Adaptive Transformer for Exemplar Based Image Translation

Generative adversarial transformers

Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond

MAGVLT: Masked Generative Vision-and-Language Transformer

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

What to Hide from Your Students: Attention-Guided Masked Image Modeling

MAT: Mask-Aware Transformer for Large Hole Image Inpainting