All You Need Is Context: Clinician Evaluations of various iterations of a Large Language Model-Based First Aid Decision Support Tool in Ghana

Paulina Boadiwaa Mensah,Nana Serwaa Quao,Sesinam Dagadu,James Kwabena Mensah,Jude Domfeh Darkwah,Project Genie Clinician Evaluation Group [1]
DOI: https://doi.org/10.1101/2024.04.03.24305276
2024-04-25
Abstract:As advancements in research and development expand the capabilities of Large Language Models (LLMs), there is a growing focus on their applications within the healthcare sector, driven by the large volume of data generated in healthcare. There are a few medicine-oriented evaluation datasets and benchmarks for assessing the performance of various LLMs in clinical scenarios; however, there is a paucity of information on the real-world usefulness of LLMs in context-specific scenarios in resource-constrained settings. In this work, 5 iterations of a decision support tool for medical emergencies using 5 distinct generalized LLMs were constructed, alongside a combination of Prompt Engineering and Retrieval Augmented Generation techniques. 50 responses were generated from the LLMs. Quantitative and qualitative evaluations of the LLM responses were provided by 13 physicians (general practitioners) with an average of 3 years of practice experience managing medical emergencies in resource-constrained settings in Ghana. Machine evaluations of the LLM responses were also computed and compared with the expert evaluations.
Health Informatics
What problem does this paper attempt to address?
This paper discusses the application issues of large language models (LLMs) in medical emergency decision support tools, especially in resource-limited environments. In the study, the authors built five different versions of general LLM-based decision support tools, combined with prompt engineering and retrieval augmented generation technology. They generated 50 responses from LLMs and evaluated these responses both quantitatively and qualitatively with 13 doctors who had an average of 3 years of practical experience. The evaluation focused on the applicability of LLMs in specific clinical scenarios and their practical value in resource-limited countries (LMICs). The study found that when LLMs were provided with prompts containing specific background information, the first aid recommendations they provided were significantly different. By comparing the evaluations between the machine-generated responses and the expert evaluations, the study highlighted the importance of considering prompt context in assessing the performance of LLMs. Currently, although some LLMs score high in medical natural language processing benchmarks, their translation value in the actual clinical scenarios in LMICs is not yet clear. The paper also cited previous work, indicating the potential of LLMs to improve healthcare in automation tasks and clinical decision support, but the cultural and social relevance needs to be considered when applying them in LMICs. In addition, the study compared different combinations of LLMs and retrieval augmented generation technologies, finding that correctly applying RAG technology can improve model performance. The research methodology included selecting several high-performance LLMs, adjusting parameters, using prompt engineering and RAG technology to generate responses, and evaluating them by doctors familiar with these environments and clinical situations. The results showed that general LLMs with moderate prompt engineering received satisfactory evaluations from doctors in terms of diagnosis and emergency guidance. However, there was no correlation between machine evaluations and human evaluations, emphasizing the importance of considering context and the necessity of human evaluations. In conclusion, this paper aims to fill the knowledge gap of LLMs' application in clinical scenarios in LMICs, and suggests that future research should further explore the performance of LLMs in various environments to develop effective and cost-efficient clinical decision support tools suitable for resource-limited regions.