NARAIM: Native Aspect Ratio Autoregressive Image Models

Daniel Gallo Fernández,Robert van der Klis,Rǎzvan-Andrei Matişan,Janusz Partyka,Efstratios Gavves,Samuele Papa,Phillip Lippe
2024-10-14
Abstract:While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the field of computer vision, the existing pre - training methods have not yet demonstrated scaling laws similar to those of language models. In particular, although autoregressive models show promising results, they usually crop or convert images into squares during training, which distorts or destroys the information in the input images. Specifically, the paper points out: 1. **Existing problems**: - Although Vision Transformers can solve various computer vision tasks, no pre - training method has been found to show favorable scaling laws as language models do. - Autoregressive models usually crop or convert images into squares during training, which distorts or destroys the original spatial information in the input images. 2. **Solutions**: - NARAIM (Native Aspect Ratio Autoregressive Image Models), an autoregressive image model pre - trained using the native aspect ratio, is proposed. - By maintaining the native aspect ratio of the image, NARAIM preserves the original spatial context, thereby enhancing the model's ability to interpret visual information. 3. **Objectives**: - Prove that maintaining the native aspect ratio of the image can improve the performance of downstream classification tasks. - Study the impact of different position embeddings (such as absolute position embeddings and fractional position embeddings) and data augmentation methods (such as random cropping) on model performance. ### Summary The main purpose of this paper is to solve the problem of spatial information loss caused by cropping or converting images into squares during the training process of current autoregressive image models by introducing the NARAIM model. By maintaining the native aspect ratio of the image, NARAIM aims to improve the performance of the model in downstream tasks, especially when dealing with non - square images.