Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

Jingru Jia,Zehua Yuan,Junhao Pan,Paul E. McNamara,Deming Chen
2024-11-01
Abstract:When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Several empirical studies have investigated the rationality and social behavior performance of LLMs, yet their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of LLMs. Through a multiple-choice-list experiment, we estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities. However, there are significant variations in the degree to which these behaviors are expressed across different LLMs. We also explore their behavior when embedded with socio-demographic features, uncovering significant disparities. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for developing standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.
Artificial Intelligence,Computers and Society,Human-Computer Interaction,Machine Learning,Theoretical Economics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether the decision - making behaviors of large language models (LLMs) in uncertain situations are consistent with human behavior patterns, and whether these models have potential biases. Specifically, the research focuses on the following points: 1. **Risk Preference**: Evaluate the choice tendency of LLMs when facing risks, that is, whether they tend to take risks or avoid risks. 2. **Probability Weighting**: Examine the degree of importance that LLMs attach to events with different probabilities, especially the way they handle small - probability events. 3. **Loss Aversion**: Analyze the reactions of LLMs when facing potential losses, that is, whether the impact of losses on them is greater than the impact of equivalent gains. In order to systematically evaluate these issues, the paper proposes a framework based on behavioral economics theory. Through multiple choice - list experiments, the risk preference, probability weighting, and loss aversion degrees of LLMs are estimated. In addition, the research also explores how these characteristics affect their decision - making behaviors when LLMs are endowed with sociodemographic characteristics, revealing the behavioral differences of different models when dealing with specific groups (such as minority sexual orientation groups or people with physical disabilities). ### Main Contributions 1. **Develop a Comprehensive Evaluation Framework**: Based on behavioral economics theory, especially the Tanaka, Camerer, and Nguyen (TCN) model, a framework for evaluating the decision - making behavior patterns of LLMs is proposed. Through three groups of experiments, the risk preference, probability weighting, and loss aversion of LLMs are evaluated. 2. **Apply the Framework to Evaluate Commercial LLMs**: Three state - of - the - art commercial LLMs (ChatGPT - 4.0 - Turbo, Claude - 3 - Opus, and Gemini - 1.0 - pro) are evaluated. It is found that these models generally exhibit human - like behavior patterns, but there are significant differences in the expression levels of these behaviors among different models. 3. **Further Experiments with Embedded Sociodemographic Characteristics**: By embedding sociodemographic characteristics, evaluate how these characteristics affect the decision - making behaviors of LLMs. The research finds that different models exhibit different behavior patterns when dealing with specific groups, emphasizing the need to consider ethical impacts and potential biases when deploying LLMs. ### Research Methods 1. **Experimental Design**: Design multiple choice - list experiments. Each experiment contains a series of multiple - choice questions, requiring participants to make choices between different probability outcomes. Through these multiple - choice questions, the preferences and behavior patterns of participants can be inferred. 2. **Record Switching Points**: By comparing utility functions, determine the points at which participants switch from one option to another, called "switching points". 3. **Set Inequalities**: Evaluate the utility functions at each switching point and establish inequalities for estimating parameters. 4. **Parameter Estimation**: Estimate parameters through an iterative process, gradually narrowing the parameter intervals, and finally obtain the estimated values of the parameters. 5. **Behavior Evaluation**: Use the estimated parameters to evaluate the decision - making behaviors of LLMs, including both the context - free and the situation with embedded sociodemographic characteristics. ### Experimental Results 1. **Results in the Context - Free Situation**: - All three models show a tendency to avoid risks, with an average σ value greater than 0. - There are differences in the performance of different models in terms of risk preference, probability weighting, and loss aversion. For example, ChatGPT shows a higher degree of risk avoidance but less attention to potential losses; Claude shows a lower degree of risk avoidance and a higher degree of loss aversion; Gemini maintains a balance between risk and caution. 2. **Results with Embedded Sociodemographic Characteristics**: - After embedding sociodemographic characteristics, the decision - making behaviors of LLMs change significantly. For example, ChatGPT becomes more adventurous, while Gemini becomes more conservative. - Factors such as age, gender, education level, marital status, and residential area have different impacts on the decision - making behaviors of LLMs. For example, among young users, Claude is more likely to overestimate small - probability events, while Gemini shows a higher degree of loss aversion. ### Conclusion This research emphasizes the need to carefully consider ethical impacts and potential biases when deploying LLMs for decision - support, and advocates the formulation of standards.