nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation

Fabian Isensee,Tassilo Wald,Constantin Ulrich,Michael Baumgartner,Saikat Roy,Klaus Maier-Hein,Paul F. Jaeger
2024-07-25
Abstract:The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue in the field of 3D medical image segmentation where many newly proposed models and methods claim to surpass the classic nnU-Net baseline model, but these claims often lack rigorous validation. Specifically, the authors point out the following major issues: 1. **Insufficient Benchmarking**: Many new methods use datasets that are insufficient in quantity and quality during validation, making it impossible to comprehensively evaluate their performance. 2. **Unfair Comparisons**: Some studies combine innovations with additional performance enhancement techniques (such as residual connections, self-supervised pre-training, etc.), making it difficult to fairly compare the results with the baseline model. 3. **Hardware Resource Differences**: Some studies conduct experiments under different hardware conditions, leading to incomparable results. 4. **Lack of Standardized Benchmarks**: Many studies do not use strictly configured baseline models, casting doubt on the reliability of the results. To address these issues, the authors propose a series of systematic validation standards and re-evaluate current popular 3D medical image segmentation methods through large-scale benchmarking. Their goal is to promote more rigorous method validation in the field, thereby fostering genuine scientific progress.