MedRG: Medical Report Grounding with Multi-modal Large Language Model

Ke Zou,Yang Bai,Zhihao Chen,Yang Zhou,Yidi Chen,Kai Ren,Meng Wang,Xuedong Yuan,Xiaojing Shen,Huazhu Fu
2024-04-10
Abstract:Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this paper, we introduce a novel framework, Medical Report Grounding (MedRG), an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase by incorporating a unique token, BOX, into the vocabulary to serve as an embedding for unlocking detection capabilities. Subsequently, the vision encoder-decoder jointly decodes the hidden embedding and the input medical image, generating the corresponding grounding box. The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods. This study represents a pioneering exploration of the medical report grounding task, marking the first-ever endeavor in this domain.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically extract key phrases from medical reports and locate the relevant regions of these phrases in medical images in medical image analysis and radiology diagnosis. Specifically, the paper proposes a new framework - Medical Report Grounding (MedRG), aiming to use the multi - modal Large Language Model (LLM) to predict the key phrases in medical reports and their corresponding grounding boxes. This task is of great significance for improving the interpretability of medical image analysis and the accuracy of radiology diagnosis. ### Key Issues 1. **Burden of Manual Key Phrase Extraction**: Existing visual localization methods require manual extraction of key phrases from medical reports, which not only increases the complexity of the system and the workload of doctors, but may also introduce human errors. 2. **Automation and Efficiency**: How to improve the efficiency of medical report localization through automated methods, reduce the workload of doctors, and maintain high precision at the same time. 3. **Multi - modal Data Processing**: How to effectively combine text and image data and use the capabilities of large - language models to achieve end - to - end medical report localization. ### Solutions The paper proposes a new framework, MedRG, whose main features include: - **Multi - modal Large Language Model**: Use the multi - modal LLM to understand and generate key phrases in medical reports. - **<BOX> Tag**: Introduce a new tag <BOX> as an embedding to unlock the detection ability, thereby predicting the localization boxes of key phrases. - **End - to - End Training**: The entire framework can be trained end - to - end, improving the robustness and accuracy of the model. ### Experimental Results - **Quantitative Evaluation**: On the MRG - MS - CXR dataset, MedRG significantly outperforms existing methods in multiple evaluation metrics, especially achieving an accuracy rate close to 80% and over 90% in the AP30 (mIOU > 0.3) and AP10 (mIOU > 0.1) metrics respectively. - **Qualitative Analysis**: Through visual comparison, MedRG performs excellently in a variety of typical cases and can accurately extract key phrases and locate the corresponding image regions. ### Conclusion The paper proposes an innovative multi - modal medical report localization framework, MedRG. By using large - language models and multi - modal data processing techniques, it achieves efficient and accurate medical report localization. This research not only promotes the development of multi - modal medical image analysis, but also provides new tools and methods for radiology diagnosis.