BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

Yang Zhou,Tan Li Hui Faith,Yanyu Xu,Sicong Leng,Xinxing Xu,Yong Liu,Rick Siow Mong Goh
2024-10-29
Abstract:Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at <a class="link-external link-https" href="https://github.com/yangzhou12/BenchX" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of a unified, standardized, and comprehensive benchmark framework in medical vision-language pre-training (MedVLP) methods, which makes it difficult to fairly and systematically compare the performance of different methods. Specifically, existing MedVLP methods exhibit significant differences in dataset selection, preprocessing methods, and fine-tuning implementations, making it very challenging to evaluate the generalization ability of these methods across various clinically relevant tasks. To fill this gap, the authors propose BenchX, a unified benchmark framework designed to enable head-to-head comparison and systematic analysis of different MedVLP methods using a common chest X-ray dataset. The BenchX framework includes three main components: 1. **Comprehensive Datasets**: Covering 9 datasets and 4 medical tasks, ensuring diversity and representativeness of the data. 2. **Benchmark Suite**: Standardized data preprocessing, train-test splits, and parameter selection, reducing the impact of inconsistent experimental setups on MedVLP performance. 3. **Unified Fine-tuning Protocol**: Adapting to different types of MedVLP methods, ensuring consistency in tasks such as classification, segmentation, and report generation. Through BenchX, the authors established baselines for 9 state-of-the-art MedVLP methods and found that some early MedVLP methods can significantly improve performance under proper configurations, even surpassing more recent methods. This suggests the need to revisit and reassess the existing developments and conclusions in the MedVLP field.