Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

Paul Röttger,Valentin Hofmann,Valentina Pyatkin,Musashi Hinck,Hannah Rose Kirk,Hinrich Schütze,Dirk Hovy
2024-06-05
Abstract:Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to meaningfully evaluate values and opinions in large - language models (LLMs)? Most of the current evaluation methods rely on multiple - choice questions, which is inconsistent with the way users interact with LLMs in the real world, because in reality users will not ask LLMs in the form of questionnaires. The paper points out that this evaluation method has obvious artificial constraints and may lead to unstable evaluation results and lack of generalization ability. Therefore, the author challenges the existing restricted evaluation paradigm and explores more realistic unrestricted evaluation methods, especially for the Political Compass Test (PCT), a commonly used evaluation tool. Specifically, the paper explores this problem in the following aspects: 1. **Systematic Review**: The author systematically reviewed 12 related literatures on using PCT to evaluate LLMs and found that most studies forced the model to follow the multiple - choice format of PCT. 2. **Model Performance under Different Constraints**: The author shows that when the model is not forced to choose or is forced to choose in different ways, the model's answers will be significantly different. 3. **Restatement Stability**: Through the restatement experiment with minimal semantic preservation, the author finds that even a slight change in the prompt will lead to a significant difference in PCT results. 4. **Open - ended Response Setting**: The author evaluates the model's performance in a more realistic open - ended response setting and finds that there are significant differences between the model's open - ended responses and multiple - choice responses. In summary, the core problem of the paper is to explore how to more accurately evaluate values and opinions in LLMs in a situation closer to the real - use scenario, so as to provide guidance and suggestions for future evaluation methods.