POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Jianben He,Xingbo Wang,Shiyi Liu,Guande Wu,Claudio Silva,Huamin Qu
2024-10-01
Abstract:Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities within multimodal inputs. This oversight hinders the development of effective prompts that guide model multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for enhancing the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through two case studies and interviews with experts.
Human-Computer Interaction,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that most current interactive prompt engineering systems mainly focus on text or visual input, ignoring the complex interactions between different modalities in multimodal input. This neglect hinders the design of effective prompts that can fully utilize the rich context provided by multiple modalities to guide and improve the multimodal reasoning performance of large language models (LLMs). Specifically, the paper points out the following problems: 1. **Multimodal reasoning performance is sensitive to prompt changes**: Inappropriate or poorly - designed prompts may cause the model to generate wrong knowledge, resulting in biased and unreliable reasoning results. 2. **Manual inspection of each instance is time - consuming and cannot provide a comprehensive understanding**: Users need to systematically understand and inspect the multimodal reasoning performance under different prompts, but manual inspection of each instance is not only time - consuming but also cannot provide a global understanding. 3. **It is difficult to explain the reasoning process of multimodal LLMs**: In the multimodal context, the complex interactions between different modalities and the unstructured and generative nature of free - text reasoning make it particularly challenging to explain the reasoning process of LLMs. 4. **Lack of effective tools to help users design and optimize prompts**: Users need a comprehensive tool to help them efficiently design and optimize prompts to guide and improve the model's multimodal reasoning performance. To solve these problems, the paper proposes POEM (Prompt Optimization for Enhancing Multimodal Reasoning), which is a visual analysis system designed to simplify the prompt engineering process in the following ways: - **Multi - level analysis**: Provide multi - level analysis from global to local to help users comprehensively understand the impact of different prompts on multimodal reasoning performance. - **Diverse support**: Provide diverse recommendations and support, including the selection of demonstration examples and the generation of instruction principles, to help users iterate and optimize prompts. - **Interactive feedback**: Through human - machine collaboration, users can evaluate and compare the effects of different prompts in real - time and make adjustments according to the feedback. Through these methods, POEM aims to improve the efficiency and accuracy of users in designing and optimizing prompts, so as to better guide and improve the multimodal reasoning performance of LLMs.