DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation

Yiqing Xie,Sheng Zhang,Hao Cheng,Pengfei Liu,Zelalem Gero,Cliff Wong,Tristan Naumann,Hoifung Poon,Carolyn Rose
2024-10-03
Abstract:Medical text generation aims to assist with administrative work and highlight salient information to support decision-making. To reflect the specific requirements of medical text, in this paper, we propose a set of metrics to evaluate the completeness, conciseness, and attribution of the generated text at a fine-grained level. The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models. We demonstrate the effectiveness of the resulting framework, DocLens, with three evaluators on three tasks: clinical note generation, radiology report summarization, and patient question summarization. A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics. The results also highlight the need to improve open-source evaluators and suggest potential directions.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the evaluation issues in the field of medical text generation. Specifically, it proposes a framework named DOCLENS for assessing the quality of medical text generation at a fine-grained level, including aspects of completeness, conciseness, and traceability. Existing automatic evaluation methods typically provide a rough score for the entire system output without clearly indicating the specific aspects or criteria reflected by the score. Moreover, while manual evaluation can capture more fine-grained information, it is costly and lacks scalability. Therefore, this paper designs a set of automated evaluation metrics to overcome these shortcomings and better meet the needs of the medical field. Experimental results show that DOCLENS significantly outperforms existing metrics in three tasks (clinical note generation, radiology report summarization, and patient question summarization) and also demonstrates a high degree of consistency with medical expert judgments in human studies.