A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Ziyan Huang,Zhongying Deng,Jin Ye,Haoyu Wang,Yanzhou Su,Tianbin Li,Hui Sun,Junlong Cheng,Jianpin Chen,Junjun He,Yun Gu,Shaoting Zhang,Lixu Gu,Yu Qiao
2023-09-08
Abstract:Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{<a class="link-external link-https" href="https://github.com/uni-medical/A-Eval" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/uni-medical/A-Eval" rel="external noopener nofollow">this https URL</a>}.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of generalization capability of abdominal multi-organ segmentation models across different datasets. The researchers introduced a new benchmark called A-Eval, specifically designed for evaluating cross-dataset abdominal multi-organ segmentation performance. The A-Eval benchmark integrates the training sets of four large public datasets (FLARE22, AMOS, WORD, and TotalSegmentator) and the validation sets of these datasets plus the training set of the BTCV dataset, forming a robust benchmark that includes five unique datasets. The paper mainly focuses on the following points: 1. **Evaluation of Model Generalization Capability**: Through the A-Eval benchmark, researchers evaluated the generalization capability of various models across different datasets, including models trained independently on each dataset, models utilizing unlabeled data, models dealing with multimodal data, and models jointly trained on multiple datasets. 2. **Impact of Data Usage Scenarios**: The study explored the impact of different data usage scenarios on model generalization capability, such as using a single dataset, leveraging unlabeled data, handling multimodal data from CT and MRI, and joint training on all available datasets. 3. **Role of Model Size**: The paper also investigated the impact of model size on cross-dataset generalization capability by comparing model variants with different parameter scales. 4. **Experimental Results and Analysis**: Through a series of experiments, the paper demonstrated the performance of models trained on different datasets on other datasets and analyzed how pseudo-labeling techniques, multimodal data usage, and model size affect the model's generalization capability. The contribution of the paper lies in proposing a comprehensive evaluation method to assess the cross-dataset generalization capability of abdominal multi-organ segmentation models in a standardized way, providing valuable insights and references for future research.