Abstract:When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Several empirical studies have investigated the rationality and social behavior performance of LLMs, yet their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of LLMs. Through a multiple-choice-list experiment, we estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities. However, there are significant variations in the degree to which these behaviors are expressed across different LLMs. We also explore their behavior when embedded with socio-demographic features, uncovering significant disparities. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for developing standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate whether the decision - making behaviors of large language models (LLMs) in uncertain situations are consistent with human behavior patterns, and whether these models have potential biases. Specifically, the research focuses on the following points: 1. **Risk Preference**: Evaluate the choice tendency of LLMs when facing risks, that is, whether they tend to take risks or avoid risks. 2. **Probability Weighting**: Examine the degree of importance that LLMs attach to events with different probabilities, especially the way they handle small - probability events. 3. **Loss Aversion**: Analyze the reactions of LLMs when facing potential losses, that is, whether the impact of losses on them is greater than the impact of equivalent gains. In order to systematically evaluate these issues, the paper proposes a framework based on behavioral economics theory. Through multiple choice - list experiments, the risk preference, probability weighting, and loss aversion degrees of LLMs are estimated. In addition, the research also explores how these characteristics affect their decision - making behaviors when LLMs are endowed with sociodemographic characteristics, revealing the behavioral differences of different models when dealing with specific groups (such as minority sexual orientation groups or people with physical disabilities). ### Main Contributions 1. **Develop a Comprehensive Evaluation Framework**: Based on behavioral economics theory, especially the Tanaka, Camerer, and Nguyen (TCN) model, a framework for evaluating the decision - making behavior patterns of LLMs is proposed. Through three groups of experiments, the risk preference, probability weighting, and loss aversion of LLMs are evaluated. 2. **Apply the Framework to Evaluate Commercial LLMs**: Three state - of - the - art commercial LLMs (ChatGPT - 4.0 - Turbo, Claude - 3 - Opus, and Gemini - 1.0 - pro) are evaluated. It is found that these models generally exhibit human - like behavior patterns, but there are significant differences in the expression levels of these behaviors among different models. 3. **Further Experiments with Embedded Sociodemographic Characteristics**: By embedding sociodemographic characteristics, evaluate how these characteristics affect the decision - making behaviors of LLMs. The research finds that different models exhibit different behavior patterns when dealing with specific groups, emphasizing the need to consider ethical impacts and potential biases when deploying LLMs. ### Research Methods 1. **Experimental Design**: Design multiple choice - list experiments. Each experiment contains a series of multiple - choice questions, requiring participants to make choices between different probability outcomes. Through these multiple - choice questions, the preferences and behavior patterns of participants can be inferred. 2. **Record Switching Points**: By comparing utility functions, determine the points at which participants switch from one option to another, called "switching points". 3. **Set Inequalities**: Evaluate the utility functions at each switching point and establish inequalities for estimating parameters. 4. **Parameter Estimation**: Estimate parameters through an iterative process, gradually narrowing the parameter intervals, and finally obtain the estimated values of the parameters. 5. **Behavior Evaluation**: Use the estimated parameters to evaluate the decision - making behaviors of LLMs, including both the context - free and the situation with embedded sociodemographic characteristics. ### Experimental Results 1. **Results in the Context - Free Situation**: - All three models show a tendency to avoid risks, with an average σ value greater than 0. - There are differences in the performance of different models in terms of risk preference, probability weighting, and loss aversion. For example, ChatGPT shows a higher degree of risk avoidance but less attention to potential losses; Claude shows a lower degree of risk avoidance and a higher degree of loss aversion; Gemini maintains a balance between risk and caution. 2. **Results with Embedded Sociodemographic Characteristics**: - After embedding sociodemographic characteristics, the decision - making behaviors of LLMs change significantly. For example, ChatGPT becomes more adventurous, while Gemini becomes more conservative. - Factors such as age, gender, education level, marital status, and residential area have different impacts on the decision - making behaviors of LLMs. For example, among young users, Claude is more likely to overestimate small - probability events, while Gemini shows a higher degree of loss aversion. ### Conclusion This research emphasizes the need to carefully consider ethical impacts and potential biases when deploying LLMs for decision - support, and advocates the formulation of standards.

Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

On the Decision-Making Abilities in Role-Playing using Large Language Models

Exploring the psychology of LLMs' Moral and Legal Reasoning

Cognitive Bias in Decision-Making with LLMs

Large Language Models Assume People are More Rational than We Really are

Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

Determinants of LLM-assisted Decision-Making

LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

Learning to be Homo Economicus: Can an LLM Learn Preferences from Choice

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play

Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

DeLLMa: Decision Making Under Uncertainty with Large Language Models

How Ethical Should AI Be? How AI Alignment Shapes the Risk Preferences of LLMs

Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference