Abstract:As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

Language Model Inversion

Extracting Prompts by Inverting LLM Outputs

Reverse Prompt Engineering

Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations

Demystifying Prompts in Language Models via Perplexity Estimation

Controllable Generation from Pre-trained Language Models via Inverse Prompting

Effective Prompt Extraction from Language Models

KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank Prompts

Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem

Language Models in the Loop: Incorporating Prompting into Weak Supervision

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance

Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution

Language Model Crossover: Variation through Few-Shot Prompting

Unveiling and Manipulating Prompt Influence in Large Language Models

From Language Models over Tokens to Language Models over Characters

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Advancing Prompt Recovery in NLP: A Deep Dive into the Integration of Gemma-2b-it and Phi2 Models

Ask Again, Then Fail: Large Language Models' Vacillations in Judgment

Language Models as Black-Box Optimizers for Vision-Language Models

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

On the Proper Treatment of Tokenization in Psycholinguistics