ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Zijia Zhao,Longteng Guo,Tongtian Yue,Erdong Hu,Shuai Shao,Zehuan Yuan,Hua Huang,Jing Liu
2024-10-24
Abstract:In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at <a class="link-external link-https" href="https://github.com/joez17/ChatSearch" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the general conversational image retrieval of open - domain images. Specifically, the author aims to develop a system that can search for images based on multi - round dialog interactions between humans and machines. Such dialogs contain not only textual information but also visual content, so the model is required to have the ability to understand multi - modal dialogs and infer the implicit intentions of users from them in order to accurately find the target images from the database. To solve this problem, the author has made the following two main efforts: 1. **Constructing a dataset**: In order to support and evaluate the conversational image retrieval task, the author created a dataset named ChatSearch. Each target image in this dataset is accompanied by query backgrounds from multi - round multi - modal dialogs, which contain human - machine interaction content of text and vision. This makes the retrieval model have to obtain the information required for retrieving images through multi - modal understanding and complex reasoning. 2. **Proposing a model**: The author proposed a generative retrieval model named ChatSearcher. This model is trained end - to - end and can accept interleaved image - text inputs and produce relevant outputs, which also combine images and texts. ChatSearcher shows strong capabilities to reason based on context and use world knowledge to generate visual retrieval results. Moreover, it performs excellently on the ChatSearch dataset and also achieves competitive results in other image retrieval tasks and visual dialog tasks. In summary, the focus of this paper is to promote the development of the conversational image retrieval field by constructing a new dataset and proposing a new model, especially the ability to handle complex multi - modal dialogs in an open - domain environment.