Abstract:Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities within multimodal inputs. This oversight hinders the development of effective prompts that guide model multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for enhancing the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through two case studies and interviews with experts.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that most current interactive prompt engineering systems mainly focus on text or visual input, ignoring the complex interactions between different modalities in multimodal input. This neglect hinders the design of effective prompts that can fully utilize the rich context provided by multiple modalities to guide and improve the multimodal reasoning performance of large language models (LLMs). Specifically, the paper points out the following problems: 1. **Multimodal reasoning performance is sensitive to prompt changes**: Inappropriate or poorly - designed prompts may cause the model to generate wrong knowledge, resulting in biased and unreliable reasoning results. 2. **Manual inspection of each instance is time - consuming and cannot provide a comprehensive understanding**: Users need to systematically understand and inspect the multimodal reasoning performance under different prompts, but manual inspection of each instance is not only time - consuming but also cannot provide a global understanding. 3. **It is difficult to explain the reasoning process of multimodal LLMs**: In the multimodal context, the complex interactions between different modalities and the unstructured and generative nature of free - text reasoning make it particularly challenging to explain the reasoning process of LLMs. 4. **Lack of effective tools to help users design and optimize prompts**: Users need a comprehensive tool to help them efficiently design and optimize prompts to guide and improve the model's multimodal reasoning performance. To solve these problems, the paper proposes POEM (Prompt Optimization for Enhancing Multimodal Reasoning), which is a visual analysis system designed to simplify the prompt engineering process in the following ways: - **Multi - level analysis**: Provide multi - level analysis from global to local to help users comprehensively understand the impact of different prompts on multimodal reasoning performance. - **Diverse support**: Provide diverse recommendations and support, including the selection of demonstration examples and the generation of instruction principles, to help users iterate and optimize prompts. - **Interactive feedback**: Through human - machine collaboration, users can evaluate and compare the effects of different prompts in real - time and make adjustments according to the feedback. Through these methods, POEM aims to improve the efficiency and accuracy of users in designing and optimizing prompts, so as to better guide and improve the multimodal reasoning performance of LLMs.

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Large Language Models Prompting With Episodic Memory

PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

Visual Prompting in Multimodal Large Language Models: A Survey

iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop

Self-Renewal Prompt Optimizing with Implicit Reasoning

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey

QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

Prompt Middleware: Mapping Prompts for Large Language Models to UI Affordances

Are Large Language Models Good Prompt Optimizers?

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

When Emotional Stimuli meet Prompt Designing: An Auto-Prompt Graphical Paradigm

Metacognitive Prompting Improves Understanding in Large Language Models

PromptExp: Multi-granularity Prompt Explanation of Large Language Models

Efficient Prompting Methods for Large Language Models: A Survey

Helping Language Models Learn More: Multi-dimensional Task Prompt for Few-shot Tuning