FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Alice Heiman,Xiaoman Zhang,Emma Chen,Sung Eun Kim,Pranav Rajpurkar
2024-11-28
Abstract:Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report. After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0% in reducing measurement hallucinations measured by mean absolute error.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the difficulty of medical image report - generation models in generating accurate quantitative measurement results in chest X - ray (CXR) reports. Specifically, existing medical vision - language models often make mistakes or have "hallucinations" in terms of quantitative measurements when generating radiology reports, that is, the content generated by the models does not match the actual images, which seriously affects clinical reliability. #### Main problems: 1. **Measurement hallucinations**: Medical report - generation models are prone to inaccurate numerical predictions when dealing with tasks that require precise measurements, such as determining the size of lung nodules or measuring the distance from the endotracheal tube (ETT) to the carina. 2. **Clinical reliability**: Incorrect or missing measurement values may lead to adverse clinical outcomes because many reporting guidelines rely on precise thresholds. For example, if the position of the endotracheal tube is incorrect, it may lead to serious complications such as hypoxia, pneumothorax, and even death. 3. **Limitations of existing models**: Current medical report - generation models lack the ability to accurately interpret fine - grained quantitative information and spatial relationships, especially performing poorly on key measurement tasks in medical images. ### Solutions: To address the above challenges, the authors proposed the **FactCheXcker** framework. FactCheXcker is a modular tool pipeline for re - evaluating and updating measurement values in model - generated radiology reports without retraining or modifying the original model. Its core functions include: - **Query Generator**: Generate measurement queries based on the original report and identify potential measurement differences. - **Code Generator**: Generate executable code based on the queries to obtain accurate measurement results from the images. - **Report Updater**: Integrate the verified measurement results into the report and update or delete inaccurate content. Through this method, FactCheXcker can significantly reduce measurement hallucinations, improve measurement accuracy, and maintain the quality of the original report. Experimental results show that FactCheXcker achieved an average 94% reduction rate of measurement hallucinations on multiple models and significantly improved the accuracy of endotracheal tube position measurement. ### Conclusion: The proposal of FactCheXcker provides an effective solution to the problem of measurement hallucinations in medical image report generation, enhancing the reliability and practicality of these models in clinical applications.