LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Ying Nie,Wei He,Kai Han,Yehui Tang,Tianyu Guo,Fanyi Du,Yunhe Wang
2023-12-01
Abstract:Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to deploy lightweight Vision - Language Models (VLMs), especially CLIP - like models, on resource - constrained devices. Although existing CLIP - like models have shown excellent performance in various downstream tasks, such as zero - shot image classification, image - text retrieval, etc., these models usually adopt large - scale image encoders (such as ResNet50 and ViT), which makes them difficult to be deployed on edge devices. Therefore, the paper proposes a multi - level interaction paradigm to train lightweight CLIP models, aiming to improve the performance of the models on multiple downstream tasks without increasing the computational cost during inference. Specifically, the paper mainly solves the following problems: 1. **Image - text pairs without strictly one - to - one correspondence**: Some image - text pairs crawled from the network do not have a strictly one - to - one correspondence relationship, which is particularly unfavorable for the training of lightweight CLIP models. To this end, the paper improves the traditional global instance - level alignment objective by gradually softening the negative sample labels. 2. **Fine - grained alignment**: Using only the global instance - level alignment objective is not sufficient to achieve fine - grained alignment between image patches and text words. The paper introduces a token - level alignment objective based on relaxed bipartite matching to achieve finer - grained alignment. 3. **The increase in text encoder parameters does not significantly improve performance**: It has been observed that the accuracy of CLIP models does not increase correspondingly with the increase in text encoder parameters. To this end, the paper reduces the number of layers of the text encoder and introduces the Masked Language Modeling (MLM) objective to maximize the potential of the shortened text encoder. Through the above methods, the method proposed in the paper has achieved higher performance in multiple benchmark tests while maintaining the efficiency during inference.