Prompt Stability Scoring for Text Annotation with Large Language Models

Christopher Barrie,Elli Palaiologou,Petter Törnberg
2024-07-02
Abstract:Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **the output stability problem of large - language models (LMs) in text - annotation tasks**. Specifically, the paper focuses on how to evaluate and improve the classification consistency of these models under different prompt designs, namely **Prompt Stability**. The authors point out that although zero - shot or few - shot learning methods can achieve efficient data classification with simple prompts, the repeatability and stability of these classification results may be affected by minor changes in prompt design. This not only affects the reproducibility of research but may also lead to inconsistent results when different models or researchers use the same method. To address this challenge, the paper proposes a general framework to diagnose prompt stability and introduces a new metric - **Prompt Stability Score (PSS)**. This framework assesses the consistency of model outputs by automatically generating semantically similar prompts and classifying the same dataset multiple times. This method is similar to the traditional reliability assessment between human coders and intra - coder reliability assessment, aiming to provide a systematic method to test and improve the stability and reliability of large - language models in text - annotation tasks. ### Main Contributions 1. **Proposing Prompt Stability Score (PSS)**: This is a new metric for evaluating the classification consistency of large - language models under different prompt designs. 2. **Developing the Python package PromptStability**: Provides a toolkit to facilitate researchers in estimating the prompt stability score. 3. **Empirical Analysis**: Classified more than 150,000 lines of data using six different datasets and 12 different tasks to verify the validity and functionality of the prompt stability score. ### Method Overview 1. **Intra - prompt Stability**: - Classify the same dataset multiple times using the original prompt. - Calculate Krippendorff’s Alpha (KA) for each classification as the intra - prompt stability score (intra - PSS). - Evaluate the stability of the model under the same prompt by accumulating scores from multiple iterations. 2. **Inter - prompt Stability**: - Use the PEGASUS model to generate a series of semantically similar prompts, controlling the diversity of prompts by adjusting the temperature parameter. - Classify each generated prompt multiple times and calculate the inter - prompt stability score (inter - PSS). - Evaluate the performance of the model under different prompt designs by comparing the stability scores of prompts generated at different temperatures. ### Results - **Intra - prompt Stability**: The average intra - prompt stability score (intra - PSS) for most datasets is above 0.8, indicating that the model's classification results under the same prompt have high stability. - **Inter - prompt Stability**: As the semantic differences in prompts increase, the variance of the inter - prompt stability score also gradually increases, especially in prompts generated at higher temperatures. ### Conclusion The paper provides researchers with a systematic method to evaluate and improve the stability and reliability of large - language models in text - annotation tasks by proposing the Prompt Stability Score (PSS) and developing the corresponding toolkit. This is of great significance for ensuring the reproducibility of research and improving the performance of models in practical applications.