Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Qingqing Zhu,Xiuying Chen,Qiao Jin,Benjamin Hou,Tejas Sudharshan Mathai,Pritam Mukherjee,Xin Gao,Ronald M Summers,Zhiyong Lu
2024-02-17
Abstract:In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue that existing automatic evaluation metrics in radiology report generation (such as conventional natural language generation (NLG) and clinical efficacy (CE) metrics) are insufficient in capturing the semantic complexity of clinical context or overly emphasize clinical details, thereby affecting the clarity of the reports. To overcome these issues, the paper proposes a method that combines the expertise of professional radiologists with large language models (LLMs) to improve the quality of evaluations for AI-generated radiology reports. Specifically, the main contributions of the paper include: 1. **Introducing a new evaluation method**: Combining radiologists' expertise, in-context instruction learning (ICIL), and chain-of-thought (CoT) reasoning, utilizing large language models such as GPT-3.5 and GPT-4 to improve the evaluation of radiology reports. 2. **Benchmarking and performance comparison**: Experimentally validating that this method performs better than existing metrics in evaluating AI-generated radiology reports, particularly its "detailed GPT-4 (5-shot)" model, which significantly outperforms other methods on certain metrics. 3. **Public release of expert annotations**: Planning to publicly release radiologists' annotation data to set new standards for future evaluations. 4. **Interpretability and explainability**: Providing not only comprehensive and accurate evaluations but also detailed explanations, enhancing the transparency and practicality of the evaluations. Through these contributions, the paper aims to improve the quality assessment of AI-generated medical reports, ensuring the accuracy and reliability of these reports in clinical applications.