Standardizing and Scaffolding Healthcare AI-Chatbot Evaluation

Yining Hua,Winna Xia,David W. Bates,George Luke Hartstein,Hyungjin Tom Kim,Michael Lingzhi Li,Benjamin W Nelson,Charles Stromeyer IV,Darlene King,Jina Suh,Li Zhou,John Torous
DOI: https://doi.org/10.1101/2024.07.21.24310774
2024-09-03
Abstract:The rapid rise of healthcare chatbots, valued at $787.1 million in 2022 and projected to grow at 23.9% annually through 2030, underscores the need for robust evaluation frameworks. Despite their potential, the absence of standardized evaluation criteria and rapid AI advancements complicate assessments. This study addresses these challenges by developing the first comprehensive evaluation framework inspired by health app regulations and integrating insights from diverse stakeholders. Following PRISMA guidelines, we reviewed 11 existing frameworks, refining 271 questions into a structured framework encompassing three priority constructs, 18 second-level constructs, and 60 third-level constructs. Our framework emphasizes safety, privacy, trustworthiness, and usefulness, aligning with recent concerns about AI in healthcare. This adaptable framework aims to serve as the initial step in facilitating the responsible integration of chatbots into healthcare settings.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of a standardized framework for the evaluation of current healthcare chatbots. Although these chatbots have great potential in the healthcare field, their rapid development and technological progress make it difficult for existing evaluation methods to keep up, resulting in inconsistent and chaotic evaluation standards. This not only affects the comparison between different chatbots but may also pose risks to users' privacy, security, and trust. Therefore, the goal of the paper is to develop a comprehensive and highly adaptable evaluation framework to guide the responsible evaluation and implementation of healthcare chatbots. Specifically, the paper addresses this problem through the following steps: 1. **Literature Review and Framework Integration**: The authors systematically reviewed 11 existing evaluation frameworks following the PRISMA guidelines and extracted 356 questions from these frameworks. After screening and integration, a comprehensive evaluation framework containing 271 questions was finally formed. 2. **Multi - level Structure Design**: This framework is designed as a multi - level tree - like structure, including three priority constructs (Safety, Privacy, and Fairness; Trustworthiness and Usefulness; Design and Operational Effectiveness), 18 second - level constructs, and 60 third - level constructs. This structure ensures the comprehensiveness and meticulousness of the evaluation. 3. **Multi - stakeholder Participation**: During the design process of the framework, the authors widely solicited the opinions of developers, clinicians, patients, and policymakers to ensure the practicality and applicability of the framework. 4. **Flexibility and Adaptability**: The design of the framework takes into account the needs and usage scenarios of different users and can be adjusted and customized according to different evaluation purposes. Through these efforts, the paper aims to provide a standardized evaluation tool to help practitioners, researchers, and policymakers in the healthcare field better understand and evaluate the performance of chatbots, thereby promoting their safe and effective application in the healthcare field.