Abstract:Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

What matters when building vision-language models?

Building and better understanding vision-language models: insights and future directions

Rethinking Overlooked Aspects in Vision-Language Models

An Introduction to Vision-Language Modeling

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Towards Interpreting Visual Information Processing in Vision-Language Models

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

Towards Better Vision-Inspired Vision-Language Models

Vision-Language Models for Vision Tasks: A Survey

Rethinking VLMs and LLMs for Image Classification

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

POINTS: Improving Your Vision-language Model with Affordable Strategies

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks