Large Language Models can Share Images, Too!

Young-Jun Lee,Dokyong Lee,Joo Won Sung,Jonghwan Hyeon,Ho-Jin Choi
2024-07-04
Abstract:This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. To facilitate a comprehensive evaluation of LLMs, we introduce the PhotoChat++ dataset, which includes enriched annotations (i.e., intent, triggering sentence, image description, and salient information). Furthermore, we present the gradient-free and extensible Decide, Describe, and Retrieve (DribeR) framework. With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting, with ChatGPT achieving the best performance. Our findings also reveal the emergent image-sharing ability in LLMs under zero-shot conditions, validating the effectiveness of DribeR. We use this framework to demonstrate its practicality and effectiveness in two real-world scenarios: (1) human-bot interaction and (2) dataset augmentation. To the best of our knowledge, this is the first study to assess the image-sharing ability of various LLMs in a zero-shot setting. We make our source code and dataset publicly available at <a class="link-external link-https" href="https://github.com/passing2961/DribeR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the image - sharing ability of large language models (LLMs) in a zero - sample setting. Specifically, the author explores how to utilize existing large language models to achieve image - sharing behavior without additional training. This research mainly focuses on two aspects: 1. **Image - sharing decision**: Determine whether an image should be shared given the conversation history. 2. **Image description and retrieval**: Generate an image description related to the conversation and retrieve the corresponding image based on this description. ### Main contributions 1. **Propose the DRIBE R framework**: This is a gradient - free, scalable and general - purpose framework for evaluating the image - sharing ability of LLMs in a zero - sample setting. The DRIBE R framework consists of three stages: - **Decision**: Decide whether an image should be shared according to the conversation history and intention. - **Description**: Generate an image description related to the conversation. - **Retrieval**: Retrieve the corresponding image based on the generated image description. 2. **Introduce the PHOTO CHAT++ dataset**: This is an extended multi - modal conversation dataset that contains rich annotation information, such as intention, trigger sentences, image descriptions and salient information. This information helps to more comprehensively evaluate the image - sharing ability of LLMs. 3. **Experimental verification**: Through extensive experiments, the effectiveness and generality of the DRIBE R framework are verified. The experimental results show that ChatGPT performs best in a zero - sample setting and can effectively unlock the image - sharing ability of LLMs. 4. **Practical applications**: Demonstrate the application of the DRIBE R framework in two practical scenarios: - **Human - machine interaction**: Evaluate the performance of DRIBE R in real - world human - machine interaction on the VisDial dataset. The results show that DRIBE R significantly outperforms the recent ChatIR system. - **Dataset enhancement**: Improve the generalization performance of the model by using DRIBE R to enhance the PHOTO CHAT dataset. ### Problems solved - **Zero - sample image sharing**: Without additional training, evaluate whether LLMs can decide when to share an image according to the conversation history and generate the corresponding image description. - **Complex conversation understanding**: Solve the problem of the limited understanding of the conversation context in existing methods when understanding and generating image descriptions. - **Multi - modal tasks**: Improve the evaluation and understanding of multi - modal tasks by introducing rich annotation information. In general, this paper systematically evaluates the image - sharing ability of LLMs in a zero - sample setting by proposing the DRIBE R framework and the PHOTO CHAT++ dataset, and shows its potential in practical applications.