Abstract:Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **the output stability problem of large - language models (LMs) in text - annotation tasks**. Specifically, the paper focuses on how to evaluate and improve the classification consistency of these models under different prompt designs, namely **Prompt Stability**. The authors point out that although zero - shot or few - shot learning methods can achieve efficient data classification with simple prompts, the repeatability and stability of these classification results may be affected by minor changes in prompt design. This not only affects the reproducibility of research but may also lead to inconsistent results when different models or researchers use the same method. To address this challenge, the paper proposes a general framework to diagnose prompt stability and introduces a new metric - **Prompt Stability Score (PSS)**. This framework assesses the consistency of model outputs by automatically generating semantically similar prompts and classifying the same dataset multiple times. This method is similar to the traditional reliability assessment between human coders and intra - coder reliability assessment, aiming to provide a systematic method to test and improve the stability and reliability of large - language models in text - annotation tasks. ### Main Contributions 1. **Proposing Prompt Stability Score (PSS)**: This is a new metric for evaluating the classification consistency of large - language models under different prompt designs. 2. **Developing the Python package PromptStability**: Provides a toolkit to facilitate researchers in estimating the prompt stability score. 3. **Empirical Analysis**: Classified more than 150,000 lines of data using six different datasets and 12 different tasks to verify the validity and functionality of the prompt stability score. ### Method Overview 1. **Intra - prompt Stability**: - Classify the same dataset multiple times using the original prompt. - Calculate Krippendorff’s Alpha (KA) for each classification as the intra - prompt stability score (intra - PSS). - Evaluate the stability of the model under the same prompt by accumulating scores from multiple iterations. 2. **Inter - prompt Stability**: - Use the PEGASUS model to generate a series of semantically similar prompts, controlling the diversity of prompts by adjusting the temperature parameter. - Classify each generated prompt multiple times and calculate the inter - prompt stability score (inter - PSS). - Evaluate the performance of the model under different prompt designs by comparing the stability scores of prompts generated at different temperatures. ### Results - **Intra - prompt Stability**: The average intra - prompt stability score (intra - PSS) for most datasets is above 0.8, indicating that the model's classification results under the same prompt have high stability. - **Inter - prompt Stability**: As the semantic differences in prompts increase, the variance of the inter - prompt stability score also gradually increases, especially in prompts generated at higher temperatures. ### Conclusion The paper provides researchers with a systematic method to evaluate and improve the stability and reliability of large - language models in text - annotation tasks by proposing the Prompt Stability Score (PSS) and developing the corresponding toolkit. This is of great significance for ensuring the reproducibility of research and improving the performance of models in practical applications.

Prompt Stability Scoring for Text Annotation with Large Language Models

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation

POSIX: A Prompt Sensitivity Index For Large Language Models

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

PromptBench: A Unified Library for Evaluation of Large Language Models

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling

StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

PACE: Improving Prompt with Actor-Critic Editing for Large Language Model

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem

AXCEL: Automated eXplainable Consistency Evaluation using LLMs

Which is better? Exploring Prompting Strategy For LLM-based Metrics

Supervisory Prompt Training

Robust Prompt Optimization for Large Language Models Against Distribution Shifts