Abstract:The ability of large language models (LLMs) to $``$learn in context$"$ based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.

Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts

The language of prompting: What linguistic properties make a prompt successful?

Comparative Analysis of Prompt Strategies for Large Language Models: Single-Task vs. Multitask Prompts

NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli

Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance

Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

On the Worst Prompt Performance of Large Language Models

Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

Automatic Prompt Selection for Large Language Models

Prompt2Model: Generating Deployable Models from Natural Language Instructions

Demystifying Prompts in Language Models via Perplexity Estimation

Sensitivity and Robustness of Large Language Models to Prompt in Japanese

Large Language Models are Contrastive Reasoners

The art of prompts' formulation: limitations, potential, and practical examples in large language models

Do Prompt-Based Models Really Understand the Meaning of their Prompts?

Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey

Do Prompts Solve NLP Tasks Using Natural Language?

Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models

Deconstructing In-Context Learning: Understanding Prompts via Corruption

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Helping Language Models Learn More: Multi-dimensional Task Prompt for Few-shot Tuning