Turning a CLIP modal into image-text matching

Yafei Bu,Jintao Wang,Tao Yao,Ze Li,Shouyong Peng
DOI: https://doi.org/10.1117/12.2684681
2023-07-21
Abstract:Image-text matching (ITM) benefits from Large-scale Contrastive Language-Image Pre-training (CLIP) method that achieves higher accuracy. However, the CLIP method learns by contrasting global visual and textual features, which inevitably leads to the problem of mismatching in the image-text process due to the lack of inter-modal fine-grained information. Therefore, in this work, we propose a method called Turning a CLIP Model into Image-Text Matching (CIT) that focuses on combining fine-grained information between modalities to convert the CLIP model into a more efficient ITM model. The CIT method effectively improves the image-text matching accuracy of existing CLIP model and does not require additional pre-training. We demonstrate the effectiveness of our method through experiments with a range of state-of-the-art methods on two widely used datasets.
Engineering,Computer Science
What problem does this paper attempt to address?