A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

Thomas Yu Chow Tam,Sonish Sivarajkumar,Sumit Kapoor,Alisa V Stolyar,Katelyn Polanska,Karleigh R McCarthy,Hunter Osterhoudt,Xizhi Wu,Shyam Visweswaran,Sunyang Fu,Piyush Mathur,Giovanni E. Cacciamani,Cong Sun,Yifan Peng,Yanshan Wang
2024-09-24
Abstract:With generative artificial intelligence (AI), particularly large language models (LLMs), continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024. The review examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Drawing on the diverse evaluation strategies employed in these studies, we propose a comprehensive and practical framework for human evaluation of LLMs: QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is: In the healthcare field, the application of large language models (LLMs) is increasing. However, existing evaluation methods mainly rely on automated metrics, which cannot comprehensively assess the actual utility, accuracy, and safety of LLMs in clinical settings. Therefore, the paper aims to identify and analyze current research on the manual evaluation of LLMs in the healthcare field through a systematic literature review and propose a standardized manual evaluation framework—QUEST (Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, Trust and Confidence)—to improve the reliability, generalizability, and applicability of LLMs in various medical applications. Specifically, the paper focuses on the following aspects: 1. Identify and analyze studies that report the manual evaluation of LLMs. 2. Explore various manual evaluation methods used to assess LLMs and their variations in complex medical contexts. 3. Synthesize insights from the literature to propose best practices for designing and implementing rigorous manual evaluations. 4. Provide practical guidelines for developing standardized evaluation frameworks. Through this research, the authors hope to fill the gap in the existing literature regarding the manual evaluation of LLMs and lay the foundation for future research, particularly in the intersection of generative AI and medicine.