MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

Dequan Wang,Xiaosong Wang,Lilong Wang,Mengzhang Li,Qian Da,Xiaoqiang Liu,Xiangyu Gao,Jun Shen,Junjun He,Tian Shen,Qi Duan,Jie Zhao,Kang Li,Yu Qiao,Shaoting Zhang
2023-06-16
Abstract:Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications. Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples, e.g., in-context learning. Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks. In this paper, we aim at approaches adapting the foundation models for medical image classification and present a novel dataset and benchmark for the evaluation, i.e., examining the overall performance of accommodating the large-scale foundation models downstream on a set of diverse real-world clinical tasks. We collect five sets of medical imaging data from multiple institutes targeting a variety of real-world clinical tasks (22,349 images in total), i.e., thoracic diseases screening in X-rays, pathological lesion tissue screening, lesion detection in endoscopy images, neonatal jaundice evaluation, and diabetic retinopathy grading. Results of multiple baseline methods are demonstrated using the proposed dataset from both accuracy and cost-effective perspectives.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of base model adaptability in medical image classification and proposes a new dataset and benchmark—MedFMC. Specifically, the paper focuses on the following aspects: 1. **Generality**: The MedFMC dataset includes 5 different modalities of medical image classification tasks, aiming to evaluate the general performance of methods across various data modalities and image features. 2. **Rare Disease Classification (Tail Categories)**: Due to the limited number of training samples for rare diseases, the paper proposes a method suitable for long-tail classification scenarios, i.e., training with only a small number of samples. Additionally, there is a scarcity of data during the testing phase, necessitating separate evaluation metrics to measure the algorithm's performance on these tail categories. 3. **Prediction Accuracy and Adaptation Efficiency**: Besides evaluating the prediction accuracy of algorithms, the paper also focuses on the cost-effectiveness when training with fewer samples. By combining accuracy and cost evaluation metrics, it is expected that advanced methods can further reduce the effort required to obtain high-quality annotations and lower the demand for computational resources. The paper collected 5 sets of medical imaging data from multiple institutions, totaling 22,349 images, covering various clinical tasks such as chest disease screening, pathological tissue screening, endoscopic image lesion detection, neonatal jaundice assessment, and diabetic retinopathy grading. The benchmark results demonstrate the performance of various baseline methods on these 5 tasks, including both accuracy and cost-effectiveness aspects.