Abstract:The standard way to study Large Language Models (LLMs) through benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLM's highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence should be studied as another dimension of LLM comparison alongside others such as cognitive abilities, knowledge, or model size. In this paper, we present a case-study about the stability of value expression over different contexts (simulated conversations on different topics), and as measured using a standard psychology questionnaire (PVQ) and behavioral downstream tasks. We consider 21 LLMs from six families. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We explore two settings: with and without instructing LLMs to simulate particular personalities. We observe similar trends in the stability of models and model families-Mixtral, Mistral, GPT-3.5 and Qwen families being more stable than LLaMa-2 and Phi-over those two settings, two different simulated populations, and even on three downstream behavioral tasks. When instructed to simulate particular personas, LLMs exhibit low Rank-Order stability, and this stability further diminishes with conversation length. This highlights the need for future research directions on LLMs that can coherently simulate a diversity of personas, as well as how context-dependence can be studied in more thorough and efficient ways. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs. The project website with code is available at https://sites.google.com/view/llmvaluestability.

LLM Stability: A detailed analysis with some surprises

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Stick to your role! Stability of personal values expressed in large language models

Consistency Matters: Explore LLMs Consistency From a Black-Box Perspective

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Dissociation of Faithful and Unfaithful Reasoning in LLMs

Understanding and Mitigating Language Confusion in LLMs

A Survey on LLM-as-a-Judge

Evaluating the Consistency of LLM Evaluators

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Large Language Models are Inconsistent and Biased Evaluators

Dissecting Human and LLM Preferences

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Benchmarking LLMs via Uncertainty Quantification

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks