STEER: Assessing the Economic Rationality of Large Language Models

Narun Raman,Taylor Lundy,Samuel Amouyal,Yoav Levine,Kevin Leyton-Brown,Moshe Tennenholtz
2024-05-29
Abstract:There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
Computation and Language,General Economics
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the economic rationality of large language models (LLMs) when used as decision agents. Specifically, the researchers face the following key questions: 1. **Choosing the right model**: How to select the most suitable LLM to perform specific decision tasks? 2. **Designing effective prompts**: How to optimize the performance of LLMs through prompting, such as whether chain-of-thought reasoning is needed? 3. **Evaluating economic rationality**: How to systematically assess the performance of an LLM in various economic decision tasks to ensure its behavior is reliable and trustworthy? To answer these questions, the paper proposes a benchmark framework named STEER (Systematic and Tuneable Evaluation of Economic Rationality). The main contributions of the STEER framework include: - **Classification of elements of economic rationality**: The paper first provides a detailed classification of economic rationality, defining 64 specific "rationality elements" that cover various aspects from basic mathematical abilities to decision-making in complex multi-agent environments. - **Generation and validation of test questions**: Based on the above classification, the researchers generated a large number of multiple-choice questions and ensured the quality and accuracy of these questions through manual validation. - **STEER report card**: Using the STEER framework, detailed report cards can be generated to evaluate the performance of different LLMs in various decision tasks, including the impact of factors such as model size, self-explanation, and few-shot prompting on performance. Through this systematic evaluation method, the researchers hope to provide a reliable assessment standard for the application of LLMs in the field of economic decision-making, thereby promoting further development in this area.