Rethinking Model Prototyping through the MedMNIST+ Dataset Collection

Sebastian Doerrich,Francesco Di Salvo,Julius Brockmann,Christian Ledig
2024-05-08
Abstract:The integration of deep learning based systems in clinical practice is often impeded by challenges rooted in limited and heterogeneous medical datasets. In addition, prioritization of marginal performance improvements on a few, narrowly scoped benchmarks over clinical applicability has slowed down meaningful algorithmic progress. This trend often results in excessive fine-tuning of existing methods to achieve state-of-the-art performance on selected datasets rather than fostering clinically relevant innovations. In response, this work presents a comprehensive benchmark for the MedMNIST+ database to diversify the evaluation landscape and conduct a thorough analysis of common convolutional neural networks (CNNs) and Transformer-based architectures, for medical image classification. Our evaluation encompasses various medical datasets, training methodologies, and input resolutions, aiming to reassess the strengths and limitations of widely used model variants. Our findings suggest that computationally efficient training schemes and modern foundation models hold promise in bridging the gap between expensive end-to-end training and more resource-refined approaches. Additionally, contrary to prevailing assumptions, we observe that higher resolutions may not consistently improve performance beyond a certain threshold, advocating for the use of lower resolutions, particularly in prototyping stages, to expedite processing. Notably, our analysis reaffirms the competitiveness of convolutional models compared to ViT-based architectures emphasizing the importance of comprehending the intrinsic capabilities of different model architectures. Moreover, we hope that our standardized evaluation framework will help enhance transparency, reproducibility, and comparability on the MedMNIST+ dataset collection as well as future research within the field. Code is available at
Image and Video Processing,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issues present in the development of deep learning models in the medical field by proposing a new evaluation benchmark and a series of experimental analyses. The aim is to improve aspects such as model prototype design, training strategies, and the choice of input resolution. Specifically, the paper attempts to solve the following key problems: 1. **Limited and heterogeneous medical datasets**: The application of deep learning systems in clinical practice is restricted mainly due to the small sample size and diverse sources of available datasets, which poses challenges to the generalization ability of supervised learning algorithms. 2. **Overemphasis on marginal performance improvements in benchmark tests**: Researchers tend to fine-tune existing methods to achieve the latest results on benchmark tests, neglecting clinical practicality. This trend leads to slow actual progress in algorithms. 3. **Choice of models and training schemes**: The paper re-evaluates the performance of common Convolutional Neural Networks (CNN) and Transformer-based architectures on medical image classification tasks and explores the effects of different training schemes (such as end-to-end training, linear probing, etc.). 4. **Impact of input resolution**: The paper also examines the impact of different input resolutions on model performance, particularly the importance of selecting an appropriate resolution during the prototype design phase to accelerate the processing. Through the above analyses, the main objectives of the paper include: - Providing a comprehensive benchmarking framework that covers various medical datasets, training methods, and input resolutions to promote a deeper understanding of the strengths and limitations of commonly used models. - Re-examining the common assumptions regarding model design, training strategies, and input resolution requirements. - Recommending best practices to be considered during model development and deployment to enhance transparency, reproducibility, and comparability. In summary, the paper aims to provide guidance and support for the development of deep learning models in the medical field by introducing the new benchmark collection MedMNIST+ and a series of detailed experimental results.