Abstract:Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this paper, we introduce a novel framework, Medical Report Grounding (MedRG), an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase by incorporating a unique token, BOX, into the vocabulary to serve as an embedding for unlocking detection capabilities. Subsequently, the vision encoder-decoder jointly decodes the hidden embedding and the input medical image, generating the corresponding grounding box. The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods. This study represents a pioneering exploration of the medical report grounding task, marking the first-ever endeavor in this domain.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to automatically extract key phrases from medical reports and locate the relevant regions of these phrases in medical images in medical image analysis and radiology diagnosis. Specifically, the paper proposes a new framework - Medical Report Grounding (MedRG), aiming to use the multi - modal Large Language Model (LLM) to predict the key phrases in medical reports and their corresponding grounding boxes. This task is of great significance for improving the interpretability of medical image analysis and the accuracy of radiology diagnosis. ### Key Issues 1. **Burden of Manual Key Phrase Extraction**: Existing visual localization methods require manual extraction of key phrases from medical reports, which not only increases the complexity of the system and the workload of doctors, but may also introduce human errors. 2. **Automation and Efficiency**: How to improve the efficiency of medical report localization through automated methods, reduce the workload of doctors, and maintain high precision at the same time. 3. **Multi - modal Data Processing**: How to effectively combine text and image data and use the capabilities of large - language models to achieve end - to - end medical report localization. ### Solutions The paper proposes a new framework, MedRG, whose main features include: - **Multi - modal Large Language Model**: Use the multi - modal LLM to understand and generate key phrases in medical reports. - **<BOX> Tag**: Introduce a new tag <BOX> as an embedding to unlock the detection ability, thereby predicting the localization boxes of key phrases. - **End - to - End Training**: The entire framework can be trained end - to - end, improving the robustness and accuracy of the model. ### Experimental Results - **Quantitative Evaluation**: On the MRG - MS - CXR dataset, MedRG significantly outperforms existing methods in multiple evaluation metrics, especially achieving an accuracy rate close to 80% and over 90% in the AP30 (mIOU > 0.3) and AP10 (mIOU > 0.1) metrics respectively. - **Qualitative Analysis**: Through visual comparison, MedRG performs excellently in a variety of typical cases and can accurately extract key phrases and locate the corresponding image regions. ### Conclusion The paper proposes an innovative multi - modal medical report localization framework, MedRG. By using large - language models and multi - modal data processing techniques, it achieves efficient and accurate medical report localization. This research not only promotes the development of multi - modal medical image analysis, but also provides new tools and methods for radiology diagnosis.

MedRG: Medical Report Grounding with Multi-modal Large Language Model

Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Visual Grounding of Whole Radiology Reports for 3D CT Images

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Joint Embedding of Deep Visual and Semantic Features for Medical Image Report Generation

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

MAIRA-2: Grounded Radiology Report Generation

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

AutoRG-Brain: Grounded Report Generation for Brain MRI

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Multifocal region-assisted cross-modality learning for chest X-ray report generation

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation

Customizing General-Purpose Foundation Models for Medical Report Generation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Visual Grounding With Joint Multimodal Representation and Interaction

Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation