Abstract:We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k $\times$ k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152$\times$ fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16$\times$ fewer parameters (85M vs. 1.36B/0.63B).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the sample efficiency and parameter efficiency of autoregressive image models. Specifically, the paper introduces a new visual model named XTRA, which significantly enhances the sample and parameter efficiency compared with previous autoregressive image models through a novel autoregressive target pre - training. ### Sample Efficiency Although XTRA uses 152 times fewer samples (13.1M vs. 2B) during the training process than the previous state - of - the - art autoregressive image models, the average accuracy rate of XTRA in 15 different image recognition benchmark tests still exceeds that of the previous state - of - the - art models. For example, XTRA ViT - H/14 performs better than AIM - 0.6B in these benchmark tests. ### Parameter Efficiency XTRA also performs excellently in terms of the number of parameters. Compared with autoregressive image models trained on ImageNet - 1k, XTRA ViT - B/16 achieves better performance in linear probing and attention probing tasks while using 7 to 16 times fewer parameters (85M vs. 1.36B/0.63B). ### Main Contributions 1. **High Sample Efficiency**: Although the number of training samples is reduced by 152 times, the average accuracy rate of XTRA ViT - H/14 in 15 different image recognition benchmark tests still exceeds that of the previous state - of - the - art autoregressive models of the same size. 2. **High Parameter Efficiency**: XTRA ViT - B/16 outperforms autoregressive image models trained on ImageNet - 1k in linear probing and attention probing tasks while using 7 to 16 times fewer parameters. ### Method XTRA achieves these improvements by introducing the Block Causal Mask. The Block Causal Mask divides an image into multiple blocks, each containing multiple pixels or patches, instead of relying on the traditional causal mask. This design enables the model to use its modeling ability more effectively and capture high - level structural patterns in larger image areas, rather than just focusing on high - frequency details. Through block - level prediction, XTRA can learn more abstract and semantically meaningful representations. ### Experimental Results The experimental results show that XTRA performs well in multiple image recognition tasks, especially in terms of sample and parameter efficiency. These results highlight the practicality and high efficiency of XTRA in resource - constrained environments. In conclusion, through proposing the XTRA model, this paper solves the deficiencies of autoregressive image models in terms of sample efficiency and parameter efficiency, and provides a more efficient and practical solution for visual tasks.

Sample- and Parameter-Efficient Auto-Regressive Image Models

Data-efficient Large Vision Models through Sequential Autoregression

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Scalable Pre-training of Large Autoregressive Image Models

A Survey on Vision Autoregressive Model

Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Adaptive Masked Autoencoder Transformer for Image Classification

Efficient Training of Large Vision Models via Advanced Automated Progressive Learning

Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Randomized Autoregressive Visual Generation

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Autoregressive Models in Vision: A Survey

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Pre-training on High Definition X-ray Images: An Experimental Study

Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning

Automated Progressive Learning for Efficient Training of Vision Transformers

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Autoregressive Pretraining with Mamba in Vision