Abstract:Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g.~label imbalance) and face common issues inherent in text generation models (e.g.~repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grouping them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the data bias problem in automatic medical image report generation, especially for reports of abnormal findings in chest X - rays (CXR). Existing report - generation models usually train encoder - decoder networks to generate complete reports, but these models are affected by data bias (such as label imbalance) and common problems in text - generation models (such as repetition). Therefore, the paper proposes a new method that focuses on reporting abnormal findings from radiological images rather than training complete radiological reports. Specifically, the paper proposes Conditional Visual - Semantic Embeddings (CVSE) to align images and fine - grained abnormal findings, achieving this goal through a cross - modal retrieval task. ### Main Contributions 1. **Conditional Visual - Semantic Embeddings**: Learn the conditional visual - semantic embeddings of radiological images and reports, and can measure the similarity between image regions and abnormal findings by optimizing the triplet ranking loss. 2. **Automatic Identification and Grouping of Abnormal Findings**: Developed an automatic method to identify and group abnormal findings from a large number of radiological reports. 3. **Experimental Verification**: Through comprehensive experiments, it is shown that the retrieval - based method outperforms existing generation models in terms of clinical correctness and natural - language - generation metrics. ### Method Overview - **Problem Definition**: Assume that each report \(R_a\) contains \(M\) abnormal findings (i.e., sentences), the semantic embedding of each abnormal finding \(a\) is \(v\), and the feature map of the radiological image \(I\) is \(E\). Transform them into the joint embedding space \(\mathbb{R}^d\) through a linear projection layer. - **Similarity Measurement**: Use Conditional Visual - Semantic Embeddings (CVSE) to learn the fine - grained matching between image regions and target abnormal findings, and calculate the similarity score \(d(a, I)\) between the image and the abnormal finding. - **Loss Function**: Optimize the hinge - based triplet ranking loss \(L\) to learn the visual - semantic embeddings. ### Experimental Results - **Baseline Models**: Compared with multiple generation models (such as Hier - CNN - RNN). - **Performance Comparison**: The CVSE model significantly outperforms all baseline models in terms of clinical accuracy metrics, especially in terms of precision and recall. - **Qualitative Analysis**: Through visualizing the attention mechanism, it is shown that the model can detect relevant regions, thereby determining the location of abnormal findings. ### Conclusion The paper proposes a retrieval - based method to generate reports of abnormal findings in radiological images through Conditional Visual - Semantic Embeddings, effectively alleviating the weaknesses of generation models in generating repetitive sentences and being biased towards normal findings. Future work will be extended to other medical image datasets and explore the application of transfer learning. ### Formulas - **Similarity Measurement**: \[ d(a, I)=-\sum_{1\leq j\leq w\times h}\alpha_j\|m_j - v\|^2 \] \[ \hat{\alpha}_j = v_\alpha^\top(W_\alpha[m_j; v]+b_\alpha) \] \[ \alpha=\text{softmax}(\hat{\alpha}) \] - **Final Similarity Score**: \[ d^*(a, I)=\frac{1}{2}(d(a, I_f)+d(a, I_l)) \] - **Loss Function**: \[ L=\sum_I[d^*(a^-, I)-d^*(a^+, I)+\delta]^++\sum_a[d^*(a, I^-)-d^*(a, I^+)+\delta]^+ \] Through these methods and experiments, the paper shows significant improvement in generating reports of abnormal findings in radiological images.

Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Visual-Textual Attentive Semantic Consistency for Medical Report Generation

Medical Report Generation Via Multimodal Spatio-Temporal Fusion

Translating medical image to radiological report: Adaptive multilevel multi-attention approach

Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation

Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys

Learning Semi-Structured Representations of Radiology Reports

Unifying Relational Sentence Generation and Retrieval for Medical Image Report Composition

MedCycle: Unpaired Medical Report Generation via Cycle-Consistency

Improving Radiology Report Generation with Multi-Grained Abnormality Prediction

Radiology Report Generation via Structured Knowledge-Enhanced Multi-modal Attention and Contrastive Learning.

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation

A Medical Semantic-Assisted Transformer for Radiographic Report Generation

On the Automatic Generation of Medical Imaging Reports

A medical report generation method integrating teacher–student model and encoder–decoder network

Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation

Radiology Report Generation with a Learned Knowledge Base and Multi-Modal Alignment

Attention-Based Abnormal-Aware Fusion Network For Radiology Report Generation

Clinically Coherent Radiology Report Generation with Imbalanced Chest X-rays.

Addressing Data Bias Problems for Chest X-ray Image Report Generation

Generating radiology reports via auxiliary signal guidance and a memory-driven network