Abstract:The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings' way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressive model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively.

Semantic Image Synthesis with Semantically Coupled VQ-Model

Fine-grained Semantic Constraint in Image Synthesis

Semantic Image Synthesis via Adversarial Learning

Label-free Neural Semantic Image Synthesis

Real-Time Image Semantic Retrieval Based on VQ

Semantic Probability Distribution Modeling for Diverse Semantic Image Synthesis

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning

UNet-like network fused swin transformer and CNN for semantic image synthesis

Diverse Semantic Image Synthesis via Probability Distribution Modeling

Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Semi-parametric Image Synthesis

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Generating Diverse High-Fidelity Images with VQ-VAE-2

Leveraging Visual Question Answering to Improve Text-to-Image Synthesis

Few-shot Semantic Image Synthesis with Class Affinity Transfer

Semantic Image Synthesis with Unconditional Generator

Disentangled Representation Learning for Controllable Image Synthesis: an Information-Theoretic Perspective

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

Semantic Image Synthesis Via Diffusion Models