Abstract:While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the field of computer vision, the existing pre - training methods have not yet demonstrated scaling laws similar to those of language models. In particular, although autoregressive models show promising results, they usually crop or convert images into squares during training, which distorts or destroys the information in the input images. Specifically, the paper points out: 1. **Existing problems**: - Although Vision Transformers can solve various computer vision tasks, no pre - training method has been found to show favorable scaling laws as language models do. - Autoregressive models usually crop or convert images into squares during training, which distorts or destroys the original spatial information in the input images. 2. **Solutions**: - NARAIM (Native Aspect Ratio Autoregressive Image Models), an autoregressive image model pre - trained using the native aspect ratio, is proposed. - By maintaining the native aspect ratio of the image, NARAIM preserves the original spatial context, thereby enhancing the model's ability to interpret visual information. 3. **Objectives**: - Prove that maintaining the native aspect ratio of the image can improve the performance of downstream classification tasks. - Study the impact of different position embeddings (such as absolute position embeddings and fractional position embeddings) and data augmentation methods (such as random cropping) on model performance. ### Summary The main purpose of this paper is to solve the problem of spatial information loss caused by cropping or converting images into squares during the training process of current autoregressive image models by introducing the NARAIM model. By maintaining the native aspect ratio of the image, NARAIM aims to improve the performance of the model in downstream tasks, especially when dealing with non - square images.

NARAIM: Native Aspect Ratio Autoregressive Image Models

Scalable Pre-training of Large Autoregressive Image Models

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Enhancing Neural Rendering Methods with Image Augmentations

Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Efficient Rotation Invariance in Deep Neural Networks through Artificial Mental Rotation

AIM: Adapting Image Models for Efficient Video Action Recognition

Image Reconstruction using Enhanced Vision Transformer

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

ViR: Towards Efficient Vision Retention Backbones

AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

Robust Training Using Natural Transformation

LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model

RATIR-Net: Adaptive SAR Image Reconstruction Based on Transformer Architecture