A Framework for Image Text Retrieval Based on Large Language Model

Jianwen Liao,Yingyun Yang,Yangchen Wang
DOI: https://doi.org/10.1109/CoST64302.2024.00020
2024-08-25
Abstract:Existing studies have shown that learning rich and precise semantic concepts helps to improve the performance of image-text cross-modal retrieval. However, existing models are trained using coarse-grained sentences, which suffer from semantic confusion and semantic missing, and are not conducive to accurate image-text retrieval. Therefore, this paper proposes a framework for image-text retrieval based on large language models. The framework adopts the pattern of large and small model synergy, using a large language model combined with the image information output from the pre-trained model to transform the coarse-grained descriptive text into fine-grained descriptive text. At the same time, we add a text counterfactual negative sample generation module to eliminate the spurious correlation that exists between the semantic concepts. We conducted a large number of experiments on Flickr30K and MSCOCO datasets to prove the effectiveness of the proposed method in this paper.
Computer Science
What problem does this paper attempt to address?