From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li,Huaikang Zhou,Mingze Xu
2024-08-10
Abstract:This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.
Computation and Language,Artificial Intelligence,Emerging Technologies,Human-Computer Interaction,General Economics
What problem does this paper attempt to address?
This paper attempts to address the issue of how to improve the objectivity and reliability of task performance evaluations in organizational management. Specifically, the researchers explore the potential application of large language models (LLMs), particularly GPT-4, in performance evaluations to reduce the subjectivity and biases present in traditional human evaluation methods. ### Main Issues: 1. **Improving the Objectivity of Performance Evaluations**: Traditional performance evaluations rely on the assessments of human observers (such as leaders and colleagues), which are often influenced by subjectivity and personal biases. The researchers aim to provide more objective and consistent evaluations through the use of LLMs. 2. **Reducing Bias in Evaluations**: Human evaluators are susceptible to various cognitive biases, such as the halo effect, leniency or strictness bias, etc. The researchers seek to understand whether LLMs exhibit similar biases and how these biases affect evaluation results. 3. **Enhancing Consistency and Reliability of Evaluations**: By comparing the evaluation results of LLMs with those of human evaluators, the researchers aim to verify the consistency of LLMs' evaluations across different tasks and complexities, thereby determining their feasibility as an alternative evaluation tool. ### Research Methods: - **Experimental Design**: The researchers conducted two experiments, evaluating 520 and 224 task outputs respectively. - **Data Source**: Each task was evaluated by six different evaluators to obtain multiple perspectives. - **Evaluation Comparison**: The evaluation results generated by GPT-4 were compared with the aggregated evaluations of human evaluators to analyze their consistency, reliability, and potential biases. ### Research Findings: - **High Consistency**: The evaluation results generated by GPT-4 were highly consistent with the aggregated evaluations of human evaluators, demonstrating good reliability and accuracy. - **Reduced Individual Bias**: Compared to individual evaluators, GPT-4's evaluation results were more consistent, reducing errors caused by individual differences. - **Presence of Certain Biases**: Although GPT-4 outperformed human evaluators in many aspects, the study found that it might still be influenced by certain cognitive biases, such as the halo effect. ### Conclusion: This study demonstrates the significant potential of LLMs in performance evaluations, particularly in improving the objectivity and consistency of evaluations. However, the study also highlights some limitations of LLMs, such as potential cognitive biases, providing directions for future research. Overall, this study provides an important theoretical and practical foundation for the application of AI technology in management research.