Abstract:In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at <a class="link-external link-https" href="https://github.com/joez17/ChatSearch" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the general conversational image retrieval of open - domain images. Specifically, the author aims to develop a system that can search for images based on multi - round dialog interactions between humans and machines. Such dialogs contain not only textual information but also visual content, so the model is required to have the ability to understand multi - modal dialogs and infer the implicit intentions of users from them in order to accurately find the target images from the database. To solve this problem, the author has made the following two main efforts: 1. **Constructing a dataset**: In order to support and evaluate the conversational image retrieval task, the author created a dataset named ChatSearch. Each target image in this dataset is accompanied by query backgrounds from multi - round multi - modal dialogs, which contain human - machine interaction content of text and vision. This makes the retrieval model have to obtain the information required for retrieving images through multi - modal understanding and complex reasoning. 2. **Proposing a model**: The author proposed a generative retrieval model named ChatSearcher. This model is trained end - to - end and can accept interleaved image - text inputs and produce relevant outputs, which also combine images and texts. ChatSearcher shows strong capabilities to reason based on context and use world knowledge to generate visual retrieval results. Moreover, it performs excellently on the ChatSearch dataset and also achieves competitive results in other image retrieval tasks and visual dialog tasks. In summary, the focus of this paper is to promote the development of the conversational image retrieval field by constructing a new dataset and proposing a new model, especially the ability to handle complex multi - modal dialogs in an open - domain environment.

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Chatting Makes Perfect: Chat-based Image Retrieval

Conversational Image Search

ConvSearch: A Open-Domain Conversational Search Behavior Dataset

An Exploratory Study on a Reinforcement Learning Prototype for Multimodal Image Retrieval Using a Conversational Search Interface

Dialog-based Interactive Image Retrieval

Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval

DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization

ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval

CONVERSER: Few-Shot Conversational Dense Retrieval with Synthetic Data Generation

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

End-to-End Conversational Search for Online Shopping with Utterance Transfer

Inclusive Design Insights from a Preliminary Image-Based Conversational Search Systems Evaluation

ConvSDG: Session Data Generation for Conversational Search

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

IRGen: Generative Modeling for Image Retrieval

MMChat: Multi-Modal Chat Dataset on Social Media

A Survey of Multimodal Composite Editing and Retrieval

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

VideoChat: Chat-Centric Video Understanding

GeoChat: Grounded Large Vision-Language Model for Remote Sensing