Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

Emma Croxford,Yanjun Gao,Brian Patterson,Daniel To,Samuel Tesch,Dmitriy Dligach,Anoop Mayampurath,Matthew M Churpek,Majid Afshar
DOI: https://doi.org/10.1101/2024.03.20.24304620
2024-04-09
Abstract:In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified Medical Language System (UMLS)-showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.
Health Informatics
What problem does this paper attempt to address?
This paper focuses on the application of Natural Language Generation (NLG) in medical diagnosis, specifically how to evaluate the quality of generated text. Current methods often overlook the complexity of the generation task, leading to inaccurate evaluations. The researchers established a comprehensive manual evaluation framework by comparing the outputs generated by ChatGPT-3.5-turbo with various automated evaluation metrics to examine their consistency with human judgment. The results show that none of the metrics match highly, but the SapBERT score based on the Unified Medical Language System (UMLS) performs the best. This emphasizes the importance of incorporating domain expertise in evaluation. The paper reveals the limitations of evaluating text generation quality and proposes a comprehensive manual evaluation framework as a baseline. Future work should focus on integrating medical knowledge databases, improving automated metrics, particularly optimizing SapBERT score, to enhance the evaluation quality. The study also points out the limitations of existing automated evaluation methods like ROUGE in handling highly abstract medical diagnosis texts, calling for more intricate and insightful approaches to evaluate the application of NLG in clinical decision support.