XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding

Moming Tang,Chengyu Wang,Jianing Wang,Chuanqi Tan,Songfang Huang,Cen Chen,Weining Qian
DOI: https://doi.org/10.18653/v1/2023.findings-acl.397
2023-01-01
Abstract:Recently, Contrastive Visual-Language Pretraining (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks.Yet, most CLIP-based methods require tasks-specific designs and sufficient training data.In this paper, we introduce a simple yet efficient paradigm for lowresource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models.In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem.Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues.Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings.
What problem does this paper attempt to address?