Abstract:Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at <a class="link-external link-https" href="https://github.com/jihaonew/UTA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to further enhance the learning efficiency and performance of visual - language representation using the existing pre - trained CLIP model without using additional [MASK] tokens. Specifically, the authors propose a new method named Unmasked Token Alignment (UTA), aiming to overcome the inconsistency between training and fine - tuning in existing Masked Image Modeling (MIM) methods and achieve more efficient zero - shot evaluation. ### Analysis of Main Problems 1. **High Computational Resource Requirements**: - Although contrastive learning methods such as CLIP perform excellently, training from scratch on large - scale noisy datasets is extremely computationally costly, making it unaffordable for most researchers. 2. **Limitations of Unimodal Representation Learning**: - Existing MIM methods (such as Masked Image Modeling) can perform self - supervised learning efficiently, but mainly focus on unimodal (visual or linguistic) representation and lack support for multimodal tasks. 3. **Inconsistency between Training and Fine - Tuning**: - MIM methods rely on [MASK] tokens to predict the masked tokens, which leads to an inconsistency between the training and fine - tuning stages and affects the generalization ability of the model. ### Core Contributions of the UTA Method - **Unmasked Token Alignment**: - UTA avoids introducing additional [MASK] tokens by aligning the unmasked visual tokens with the corresponding image tokens output by the frozen CLIP visual encoder, thereby reducing computational overhead and improving training efficiency. - **Direct Zero - Shot Evaluation**: - The pre - trained ViT model can be directly used for zero - shot classification and retrieval without contrastive fine - tuning on image - text pairs. - **Maintaining Training - Fine - Tuning Consistency**: - UTA only inputs and aligns unmasked tokens, ensuring consistency during training and inference, thereby improving the stability and performance of the model. ### Experimental Results The paper demonstrates the superior performance of UTA in multiple benchmark tests, including zero - shot classification, zero - shot retrieval, and multimodal tasks (such as LLaVA - Bench). In particular, the zero - shot classification accuracy on ImageNet reaches 78.5%, and after fine - tuning, it even reaches 80.8%, significantly outperforming other methods. ### Conclusion By proposing the Unmasked Token Alignment (UTA) method, this paper successfully solves the problems of high computational resource requirements, limitations of unimodal representation learning, and inconsistency between training and fine - tuning in existing methods, providing a more efficient and superior - performance new approach for visual - language representation learning.

Enhancing Vision-Language Model with Unmasked Token Alignment

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Learning with Unmasked Tokens Drives Stronger Vision Learners

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

UNITER: UNiversal Image-TExt Representation Learning

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

AlignCLIP: Enhancing Stable Representations in Vision-Language Pretraining Models through Attention and Prediction Alignment

Masked Channel Modeling for Bootstrapping Visual Pre-training

Boosting Visual-Language Models by Exploiting Hard Samples

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

MVP: Multimodality-Guided Visual Pre-training

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Leveraging per Image-Token Consistency for Vision-Language Pre-training

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Masked Image Contrastive Learning for Efficient Visual Conceptual Pre-training

MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

What Do Self-Supervised Vision Transformers Learn?