Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

Kengo Nakata,Daisuke Miyashita,Youyang Ng,Yasuto Hoshi,Jun Deguchi
2024-08-29
Abstract:In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the era of the rise of multimodal large language models (M - LLMs), re - think the image retrieval method based on sparse vocabulary representation. Specifically, the author hopes to use M - LLMs that support visual prompts to extract image features and convert them into text data, so that efficient sparse retrieval algorithms in natural language processing can be used to perform image retrieval tasks. ### Problem Background Traditional image retrieval methods usually rely on the dense vector representations generated by deep neural networks (DNNs). Although these methods perform well in terms of accuracy and performance, they are insufficient in terms of interpretability and flexibility. In contrast, methods based on sparse vocabulary representation have higher interpretability, but their performance may not be as good as that of dense vector methods. In addition, existing vision - language models may compensate according to the knowledge obtained during the training process when dealing with incomplete queries, but this compensation does not necessarily meet the actual needs of users. ### Paper Goals 1. **Introduce a new image retrieval system**: This system uses M - LLMs and retrieval algorithms based on sparse vocabulary representation and can evaluate its effectiveness on various benchmark datasets. 2. **Enhance the ability of M - LLMs to extract image features**: Carry out key expansions by applying data augmentation techniques (such as cropping images) and quantitatively evaluate the improvement of retrieval performance. 3. **Explore keyword - driven image retrieval**: Different from traditional title - based image retrieval, this research focuses on the more common keyword - driven image retrieval scenarios, which are more common in practical applications. ### Main Contributions - Proposed a text - to - image retrieval system based on M - LLMs and sparse vocabulary representation and verified its effectiveness on multiple benchmark datasets. - Improve the effect of M - LLMs in extracting image features through data augmentation techniques (such as cropping images) and evaluate its performance through correlation measures (such as CLIPScore). - Empirical research shows that when combining keywords as search queries, the retrieval performance of the system is significantly improved. ### Solution Overview 1. **Feature Extraction**: Use M - LLMs to describe images and generate text data such as labels and titles. 2. **Encode into Sparse Vectors**: Encode the generated text data into sparse vectors. 3. **Image Retrieval**: Based on the sparse vector representation of query keywords, use algorithms such as BM25 to retrieve relevant images from the database. In this way, the author hopes to improve the interpretability and flexibility of the system while maintaining efficient retrieval, so as to better meet the needs of users.