Abstract:Interpreting medical images such as chest X-ray images and retina images is an essential step for diagnosing and treating relevant diseases. Proposing automatic and reliable medical report generation systems can reduce the time-consuming workload, improve efficiencies of clinical workflows, and decrease practical variations between different clinical professionals. Many recent approaches based on image-encoder and language-decoder structure have been proposed to tackle this task. However, some technical challenges remain to be solved, including the fusion efficacy between the language and visual cues and the difficulty of obtaining an effective pre-trained image feature extractor for medical-specific tasks. In this work, we proposed the weighted query-key interacting attention module, including both the second-order and first-order interactions. Compared with the conventional scaled dot-product attention, this design generates a strong fusion mechanism between language and visual signals. In addition, we also proposed the contrastive pre-training step to reduce the domain gap between the image encoder and the target dataset. To test the generalizability of our learning scheme, we collected and verified our model on the world-first multi-modality retina report generation dataset referred to as Retina ImBank and another large-scale retina Chinese-based report dataset referred to as Retina Chinese. These two datasets will be made publicly available and serve as benchmarks to encourage further research exploration in this field. From our experimental results, we demonstrate that our proposed method has outperformed multiple state-of-the-art image captioning and medical report generation methods on IU X-RAY, MIMIC-CXR, Retina ImBank, and Retina Chinese datasets.

Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination

Improving Medical Vision-Language Contrastive Pretraining with Semantics-aware Triage

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Multimodal Pretraining from Monolingual to Multilingual

Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Multimodal Contrastive Training for Visual Representation Learning

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Cross-modal Contrastive Attention Model for Medical Report Generation.

Medical Vision-Language Pre-Training for Brain Abnormalities

Contrastive Learning of Medical Visual Representations from Paired Images and Text

MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents