Abstract:This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.

What problem does this paper attempt to address?

This paper attempts to address the issue of how to improve the objectivity and reliability of task performance evaluations in organizational management. Specifically, the researchers explore the potential application of large language models (LLMs), particularly GPT-4, in performance evaluations to reduce the subjectivity and biases present in traditional human evaluation methods. ### Main Issues: 1. **Improving the Objectivity of Performance Evaluations**: Traditional performance evaluations rely on the assessments of human observers (such as leaders and colleagues), which are often influenced by subjectivity and personal biases. The researchers aim to provide more objective and consistent evaluations through the use of LLMs. 2. **Reducing Bias in Evaluations**: Human evaluators are susceptible to various cognitive biases, such as the halo effect, leniency or strictness bias, etc. The researchers seek to understand whether LLMs exhibit similar biases and how these biases affect evaluation results. 3. **Enhancing Consistency and Reliability of Evaluations**: By comparing the evaluation results of LLMs with those of human evaluators, the researchers aim to verify the consistency of LLMs' evaluations across different tasks and complexities, thereby determining their feasibility as an alternative evaluation tool. ### Research Methods: - **Experimental Design**: The researchers conducted two experiments, evaluating 520 and 224 task outputs respectively. - **Data Source**: Each task was evaluated by six different evaluators to obtain multiple perspectives. - **Evaluation Comparison**: The evaluation results generated by GPT-4 were compared with the aggregated evaluations of human evaluators to analyze their consistency, reliability, and potential biases. ### Research Findings: - **High Consistency**: The evaluation results generated by GPT-4 were highly consistent with the aggregated evaluations of human evaluators, demonstrating good reliability and accuracy. - **Reduced Individual Bias**: Compared to individual evaluators, GPT-4's evaluation results were more consistent, reducing errors caused by individual differences. - **Presence of Certain Biases**: Although GPT-4 outperformed human evaluators in many aspects, the study found that it might still be influenced by certain cognitive biases, such as the halo effect. ### Conclusion: This study demonstrates the significant potential of LLMs in performance evaluations, particularly in improving the objectivity and consistency of evaluations. However, the study also highlights some limitations of LLMs, such as potential cognitive biases, providing directions for future research. Overall, this study provides an important theoretical and practical foundation for the application of AI technology in management research.

From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Performance of a Large‐Language Model in scoring construction management capstone design projects

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Evaluating Large Language Models in Analysing Classroom Dialogue

A Closer Look into Using Large Language Models for Automatic Evaluation

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Can Large Language Models Be an Alternative to Human Evaluations?

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

An Empirical Analysis on Large Language Models in Debate Evaluation

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

The Promise and Peril of Generative AI: Evidence from GPT-4 as Sell-Side Analysts

DB-GPT: Large Language Model Meets Database

Large Language Models as Data Preprocessors

Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA