Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Jianmo Ni,Chun-Nan Hsu,Amilcare Gentili,Julian McAuley
DOI: https://doi.org/10.48550/arXiv.2010.02467
2020-10-06
Abstract:Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g.~label imbalance) and face common issues inherent in text generation models (e.g.~repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grouping them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the data bias problem in automatic medical image report generation, especially for reports of abnormal findings in chest X - rays (CXR). Existing report - generation models usually train encoder - decoder networks to generate complete reports, but these models are affected by data bias (such as label imbalance) and common problems in text - generation models (such as repetition). Therefore, the paper proposes a new method that focuses on reporting abnormal findings from radiological images rather than training complete radiological reports. Specifically, the paper proposes Conditional Visual - Semantic Embeddings (CVSE) to align images and fine - grained abnormal findings, achieving this goal through a cross - modal retrieval task. ### Main Contributions 1. **Conditional Visual - Semantic Embeddings**: Learn the conditional visual - semantic embeddings of radiological images and reports, and can measure the similarity between image regions and abnormal findings by optimizing the triplet ranking loss. 2. **Automatic Identification and Grouping of Abnormal Findings**: Developed an automatic method to identify and group abnormal findings from a large number of radiological reports. 3. **Experimental Verification**: Through comprehensive experiments, it is shown that the retrieval - based method outperforms existing generation models in terms of clinical correctness and natural - language - generation metrics. ### Method Overview - **Problem Definition**: Assume that each report \(R_a\) contains \(M\) abnormal findings (i.e., sentences), the semantic embedding of each abnormal finding \(a\) is \(v\), and the feature map of the radiological image \(I\) is \(E\). Transform them into the joint embedding space \(\mathbb{R}^d\) through a linear projection layer. - **Similarity Measurement**: Use Conditional Visual - Semantic Embeddings (CVSE) to learn the fine - grained matching between image regions and target abnormal findings, and calculate the similarity score \(d(a, I)\) between the image and the abnormal finding. - **Loss Function**: Optimize the hinge - based triplet ranking loss \(L\) to learn the visual - semantic embeddings. ### Experimental Results - **Baseline Models**: Compared with multiple generation models (such as Hier - CNN - RNN). - **Performance Comparison**: The CVSE model significantly outperforms all baseline models in terms of clinical accuracy metrics, especially in terms of precision and recall. - **Qualitative Analysis**: Through visualizing the attention mechanism, it is shown that the model can detect relevant regions, thereby determining the location of abnormal findings. ### Conclusion The paper proposes a retrieval - based method to generate reports of abnormal findings in radiological images through Conditional Visual - Semantic Embeddings, effectively alleviating the weaknesses of generation models in generating repetitive sentences and being biased towards normal findings. Future work will be extended to other medical image datasets and explore the application of transfer learning. ### Formulas - **Similarity Measurement**: \[ d(a, I)=-\sum_{1\leq j\leq w\times h}\alpha_j\|m_j - v\|^2 \] \[ \hat{\alpha}_j = v_\alpha^\top(W_\alpha[m_j; v]+b_\alpha) \] \[ \alpha=\text{softmax}(\hat{\alpha}) \] - **Final Similarity Score**: \[ d^*(a, I)=\frac{1}{2}(d(a, I_f)+d(a, I_l)) \] - **Loss Function**: \[ L=\sum_I[d^*(a^-, I)-d^*(a^+, I)+\delta]^++\sum_a[d^*(a, I^-)-d^*(a, I^+)+\delta]^+ \] Through these methods and experiments, the paper shows significant improvement in generating reports of abnormal findings in radiological images.