DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

Haozhe Luo,Ziyu Zhou,Corentin Royer,Anjany Sekuboyina,Bjoern Menze

2024-04-05

Abstract:Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports. However, existing approaches often face challenges in encoding medical knowledge effectively. While radiology reports provide insights into the current disease manifestation, medical definitions (as used by contemporary methods) tend to be overly abstract, creating a gap in knowledge. To address this, we propose DeViDe, a novel transformer-based method that leverages radiographic descriptions from the open web. These descriptions outline general visual characteristics of diseases in radiographs, and when combined with abstract definitions and radiology reports, provide a holistic snapshot of knowledge. DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources. Second, this knowledge is aligned with image information at various levels of granularity. Third, a novel projection layer is proposed to handle the complexity of aligning each image with multiple descriptions arising in a multi-label setting. In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets. Additionally, fine-tuning DeViDe on four downstream tasks and six segmentation tasks showcases its superior performance across data from diverse distributions.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of effectively encoding medical knowledge in medical imaging and text pre - training models. Specifically, existing methods often struggle to effectively encode medical knowledge when using paired radiological images and radiology reports for visual - language pre - training. Although radiology reports provide insights into the current manifestation of diseases, existing medical definitions are too abstract, leading to a break in knowledge. To address this challenge, the paper proposes DeViDe (Definitions and Visual Descriptions), a new Transformer - based method that enhances medical knowledge through radiological descriptions collected from the open web, which summarize the overall visual features of diseases in radiological images. When these descriptions are used in combination with abstract definitions and radiology reports, a comprehensive knowledge snapshot can be provided. DeViDe achieves knowledge - enhanced visual - language alignment through three key features: First, homogenization is carried out using a large - language model to unify medical knowledge from different sources; second, this knowledge is aligned with image information at different granularity levels; finally, a new projection layer is proposed to handle the complexity of aligning each image with multiple descriptions in a multi - label setting. In the zero - shot setting, DeViDe's performance on external datasets is comparable to that of fully - supervised models, and it has achieved state - of - the - art results on three large - scale datasets. In addition, its superior performance on data with different distributions is demonstrated by fine - tuning on four downstream tasks and six segmentation tasks.

DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

K-Diag: Knowledge-enhanced Disease Diagnosis in Radiographic Imaging

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

Deep neural models for automated multi-task diagnostic scan management—quality enhancement, view classification and report generation

MedDr: Diagnosis-Guided Bootstrapping for Large-Scale Medical Vision-Language Learning

A vision-language model with multi-granular knowledge fusion in medical imaging

Medical Vision-Language Pre-Training for Brain Abnormalities

Knowledge-enhanced visual-language pre-training on chest radiology images

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Visual–language Foundation Models in Medicine

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training