Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Wei Du,Xueting Jin,Jaryse Harris,Alessandro Brunetti,Erika Johnson,Olivia Leung,Xingchen Li,Selemon Walle,Qing Yu,Xiao Zhou,Fang Bian,Kajanna Mckenzie,Manita Kanathanavanich,Yusuf Ozcelik,Farah El-Sharkawy,Shunsuke Koga
DOI: https://doi.org/10.1101/2024.07.10.24310093
2024-10-18
Abstract:Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aimed to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with those of pathology trainees, and to assess the consistency of their responses. We selected 150 multiple-choice questions from 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions across three separate sessions between June 2023 and January January 2024, and their responses were compared with those of 14 pathology trainees (8 junior and 6 senior) from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions. ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5%, junior trainees' 45.1%, and senior trainees' 58.3%. ChatGPT's performance was notably stronger in difficult questions (61.8%-70.6%) compared to Bard (29.4%-32.4%) and trainees (5.9%-44.1%). For easy questions, ChatGPT (88.9%-94.4%) and trainees (75.0%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 80%-85% across three tests, whereas Bard exhibited greater variability with consistency rates of 54%-61%. ChatGPT consistently outperformed Bard and trainees, especially on difficult questions. While LLMs show significant promise in pathology education and practice, continued development and human oversight are crucial for reliable clinical application.
What problem does this paper attempt to address?
This paper aims to evaluate large language models (LLMs), especially ChatGPT and Bard, in the field of pathology, and compare their performance with that of pathology trainees to assess the consistency and accuracy of these models. Specifically, the researchers selected 150 multiple - choice questions from 15 subspecialties, excluding image - based questions, and conducted three tests on ChatGPT and Bard. They also compared the results with the answers of 14 pathology trainees (8 junior and 6 senior) from two hospitals. In this way, the researchers hope to understand the potential applications and limitations of LLMs in pathology education and practice. The main objectives of the study include: 1. **Evaluating the performance of LLMs**: Comparing the performance of ChatGPT and Bard on pathology multiple - choice questions, especially at different difficulty levels. 2. **Comparison with human trainees**: Comparing the performance of LLMs with that of pathology trainees to understand their differences on questions of different difficulties. 3. **Evaluating consistency and stability**: Analyzing the consistency and stability of ChatGPT and Bard in multiple tests, as well as their performance changes on questions of different difficulties. Through these objectives, the researchers hope to provide valuable insights for pathology education and clinical decision - making, and explore the potential and challenges of LLMs in future medical applications.