Abstract:Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.

KB-VLP: Knowledge Based Vision and Language Pretraining

Retrieval-based Knowledge Augmented Vision Language Pre-training

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Structure Pre-training and Prompt Tuning for Knowledge Graph Transfer

Unified Vision-Language Pre-Training for Image Captioning and VQA

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

3D Scene Graph Guided Vision-Language Pre-training

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

VLP: A Survey on Vision-language Pre-training

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Knowledge distilled pre-training model for vision-language-navigation

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA