Abstract:This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. To facilitate a comprehensive evaluation of LLMs, we introduce the PhotoChat++ dataset, which includes enriched annotations (i.e., intent, triggering sentence, image description, and salient information). Furthermore, we present the gradient-free and extensible Decide, Describe, and Retrieve (DribeR) framework. With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting, with ChatGPT achieving the best performance. Our findings also reveal the emergent image-sharing ability in LLMs under zero-shot conditions, validating the effectiveness of DribeR. We use this framework to demonstrate its practicality and effectiveness in two real-world scenarios: (1) human-bot interaction and (2) dataset augmentation. To the best of our knowledge, this is the first study to assess the image-sharing ability of various LLMs in a zero-shot setting. We make our source code and dataset publicly available at <a class="link-external link-https" href="https://github.com/passing2961/DribeR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the image - sharing ability of large language models (LLMs) in a zero - sample setting. Specifically, the author explores how to utilize existing large language models to achieve image - sharing behavior without additional training. This research mainly focuses on two aspects: 1. **Image - sharing decision**: Determine whether an image should be shared given the conversation history. 2. **Image description and retrieval**: Generate an image description related to the conversation and retrieve the corresponding image based on this description. ### Main contributions 1. **Propose the DRIBE R framework**: This is a gradient - free, scalable and general - purpose framework for evaluating the image - sharing ability of LLMs in a zero - sample setting. The DRIBE R framework consists of three stages: - **Decision**: Decide whether an image should be shared according to the conversation history and intention. - **Description**: Generate an image description related to the conversation. - **Retrieval**: Retrieve the corresponding image based on the generated image description. 2. **Introduce the PHOTO CHAT++ dataset**: This is an extended multi - modal conversation dataset that contains rich annotation information, such as intention, trigger sentences, image descriptions and salient information. This information helps to more comprehensively evaluate the image - sharing ability of LLMs. 3. **Experimental verification**: Through extensive experiments, the effectiveness and generality of the DRIBE R framework are verified. The experimental results show that ChatGPT performs best in a zero - sample setting and can effectively unlock the image - sharing ability of LLMs. 4. **Practical applications**: Demonstrate the application of the DRIBE R framework in two practical scenarios: - **Human - machine interaction**: Evaluate the performance of DRIBE R in real - world human - machine interaction on the VisDial dataset. The results show that DRIBE R significantly outperforms the recent ChatIR system. - **Dataset enhancement**: Improve the generalization performance of the model by using DRIBE R to enhance the PHOTO CHAT dataset. ### Problems solved - **Zero - sample image sharing**: Without additional training, evaluate whether LLMs can decide when to share an image according to the conversation history and generate the corresponding image description. - **Complex conversation understanding**: Solve the problem of the limited understanding of the conversation context in existing methods when understanding and generating image descriptions. - **Multi - modal tasks**: Improve the evaluation and understanding of multi - modal tasks by introducing rich annotation information. In general, this paper systematically evaluates the image - sharing ability of LLMs in a zero - sample setting by proposing the DRIBE R framework and the PHOTO CHAT++ dataset, and shows its potential in practical applications.

Large Language Models can Share Images, Too!

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

LLMGA: Multimodal Large Language Model based Generation Assistant

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Supervised Knowledge Makes Large Language Models Better In-context Learners

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

Large Language Models: A Survey

Rethinking VLMs and LLMs for Image Classification

LMEye: An Interactive Perception Network for Large Language Models

Language-Image Models with 3D Understanding

ChatGPT Alternative Solutions: Large Language Models Survey

Elucidating the design space of language models for image generation

A Survey on Multimodal Large Language Models