Abstract:In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the era of the rise of multimodal large language models (M - LLMs), re - think the image retrieval method based on sparse vocabulary representation. Specifically, the author hopes to use M - LLMs that support visual prompts to extract image features and convert them into text data, so that efficient sparse retrieval algorithms in natural language processing can be used to perform image retrieval tasks. ### Problem Background Traditional image retrieval methods usually rely on the dense vector representations generated by deep neural networks (DNNs). Although these methods perform well in terms of accuracy and performance, they are insufficient in terms of interpretability and flexibility. In contrast, methods based on sparse vocabulary representation have higher interpretability, but their performance may not be as good as that of dense vector methods. In addition, existing vision - language models may compensate according to the knowledge obtained during the training process when dealing with incomplete queries, but this compensation does not necessarily meet the actual needs of users. ### Paper Goals 1. **Introduce a new image retrieval system**: This system uses M - LLMs and retrieval algorithms based on sparse vocabulary representation and can evaluate its effectiveness on various benchmark datasets. 2. **Enhance the ability of M - LLMs to extract image features**: Carry out key expansions by applying data augmentation techniques (such as cropping images) and quantitatively evaluate the improvement of retrieval performance. 3. **Explore keyword - driven image retrieval**: Different from traditional title - based image retrieval, this research focuses on the more common keyword - driven image retrieval scenarios, which are more common in practical applications. ### Main Contributions - Proposed a text - to - image retrieval system based on M - LLMs and sparse vocabulary representation and verified its effectiveness on multiple benchmark datasets. - Improve the effect of M - LLMs in extracting image features through data augmentation techniques (such as cropping images) and evaluate its performance through correlation measures (such as CLIPScore). - Empirical research shows that when combining keywords as search queries, the retrieval performance of the system is significantly improved. ### Solution Overview 1. **Feature Extraction**: Use M - LLMs to describe images and generate text data such as labels and titles. 2. **Encode into Sparse Vectors**: Encode the generated text data into sparse vectors. 3. **Image Retrieval**: Based on the sparse vector representation of query keywords, use algorithms such as BM25 to retrieve relevant images from the database. In this way, the author hopes to improve the interpretability and flexibility of the system while maintaining efficient retrieval, so as to better meet the needs of users.

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

Multimodal Learned Sparse Retrieval for Image Suggestion

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

A Framework for Image Text Retrieval Based on Large Language Model

Modeling Image Data for Effective Indexing and Retrieval in Large General Image Databases.

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Leveraging Large Language Models for Multimodal Search

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

Retrieval-Augmented Multimodal Language Modeling

Multi-modal Auto-regressive Modeling via Visual Words

Generative Multi-Modal Knowledge Retrieval with Large Language Models

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Multi-Modal Image Retrieval for Complex Queries using Small Codes

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Effective Deep Learning-Based Multi-Modal Retrieval

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

A Unified Framework for Learned Sparse Retrieval