VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients

Yaoming Wang,Yuchen Liu,Xiaopeng Zhang,Jin Li,Bowen Shi,Chenglin Li,Wenrui Dai,Hongkai Xiong,Qi Tian
DOI: https://doi.org/10.1145/3581783.3611706
2023-01-01
Abstract:Parameter-Efficient Tuning (PET) has emerged as a leading advancement in both Natural Language Processing and Computer Vision, enabling efficient accommodation of downstream tasks without costly fine-tuning. However, most existing PET approaches are limited to uni-modal tuning, even for vision-language models like CLIP. We investigate this limitation and demonstrate that simultaneous tuning of the two modalities in such models leads to multi-modal forgetting and catastrophic performance degradation, particularly when generalizing to new classes. To address this issue, we propose a novel PET approach called VioLET (Vision Language Efficient Tuning) that utilizes collaborative multi-modal gradients to unlock the full potential of both modalities. Specifically, we incorporate an additional visual encoder without learnable parameters and use these two visual encoders to compute the gradients of the context parameters separately. When conflicts arise, we replace the original gradient with an orthogonal gradient. Extensive experiments are conducted on few-shot recognition and unseen class generalization tasks using ResNet-50 or ViT/B-16 as the backbone. VioLET consistently outperforms several state-of-the-art methods on 11 datasets, showcasing its superiority over existing PET approaches. The code is available at https://github.com/Wang-Yaoming/VioLET.
What problem does this paper attempt to address?