Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

Jack Krolik,Herprit Mahal,Feroz Ahmad,Gaurav Trivedi,Bahador Saket
2024-09-03
Abstract:This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issues present in the evaluation process of Medical Q&A Systems. Traditionally, medical professionals need to manually assess the response quality of these systems, which is not only time-consuming but also costly. The paper explores the potential of using large language models (LLMs) to automate this process, thereby saving time and costs, and improving the consistency and reproducibility of evaluations. Specifically, the research team collected a dataset containing 94 evaluation sets, each comprising three parts: questions, ground truth answers, and answers generated by an internally developed Q&A system. By using advanced language models like ChatGPT-4, the researchers designed an automated evaluation process to assess whether the system-generated answers meet predefined standards. The paper also details the evaluation metrics, including Relevance, Succinctness, Medical Correctness, Hallucination, Completeness, and Coherence. Using these metrics, the research team evaluated the performance of LLMs in medical scenarios and found that LLMs could significantly reduce the time required for evaluation, from the original 6 hours to just 35 minutes. Additionally, the paper discusses future research directions, including using multi-model approaches to enhance the robustness of the evaluation system, continuous improvement of prompt engineering, and expanding the dataset to cover more diverse medical contexts. Finally, the paper emphasizes the importance of ethical considerations, ensuring that LLMs serve as an auxiliary tool rather than replacing human expert judgment.