Abstract:Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{<a class="link-external link-https" href="https://github.com/uni-medical/A-Eval" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/uni-medical/A-Eval" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper aims to address the issue of generalization capability of abdominal multi-organ segmentation models across different datasets. The researchers introduced a new benchmark called A-Eval, specifically designed for evaluating cross-dataset abdominal multi-organ segmentation performance. The A-Eval benchmark integrates the training sets of four large public datasets (FLARE22, AMOS, WORD, and TotalSegmentator) and the validation sets of these datasets plus the training set of the BTCV dataset, forming a robust benchmark that includes five unique datasets. The paper mainly focuses on the following points: 1. **Evaluation of Model Generalization Capability**: Through the A-Eval benchmark, researchers evaluated the generalization capability of various models across different datasets, including models trained independently on each dataset, models utilizing unlabeled data, models dealing with multimodal data, and models jointly trained on multiple datasets. 2. **Impact of Data Usage Scenarios**: The study explored the impact of different data usage scenarios on model generalization capability, such as using a single dataset, leveraging unlabeled data, handling multimodal data from CT and MRI, and joint training on all available datasets. 3. **Role of Model Size**: The paper also investigated the impact of model size on cross-dataset generalization capability by comparing model variants with different parameter scales. 4. **Experimental Results and Analysis**: Through a series of experiments, the paper demonstrated the performance of models trained on different datasets on other datasets and analyzed how pseudo-labeling techniques, multimodal data usage, and model size affect the model's generalization capability. The contribution of the paper lies in proposing a comprehensive evaluation method to assess the cross-dataset generalization capability of abdominal multi-organ segmentation models in a standardized way, providing valuable insights and references for future research.

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

AbdomenCT-1K: Is Abdominal Organ Segmentation A Solved Problem

Validation and optimization of multi-organ segmentation on clinical imaging archives

Outlier Guided Optimization of Abdominal Segmentation

AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking

Rethinking Abdominal Organ Segmentation (RAOS) in the clinical scenario: A robustness evaluation benchmark with challenging cases

An Overview of Abdominal Multi-organ Segmentation

Multi-organ segmentation: a progressive exploration of learning paradigms under scarce annotation

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: the FLARE22 challenge

Boundary-Aware Network for Abdominal Multi-Organ Segmentation

Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation

Tailored multi-organ segmentation with model adaptation and ensemble

AbdomenAtlas-8K: Annotating 8,000 CT Volumes for Multi-Organ Segmentation in Three Weeks

Clinical utility gene card for: Phosphomannose isomerase deficiency

View adaptive unified self-supervised technique for abdominal organ segmentation

Contour-aware network with class-wise convolutions for 3D abdominal multi-organ segmentation

Abdominal multi-organ segmentation in Multi-sequence MRIs based on visual attention guided network and knowledge distillation

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

Multi-organ Segmentation over Partially Labeled Datasets with Multi-scale Feature Abstraction

Automatic Organ and Pan-cancer Segmentation in Abdomen CT: the FLARE 2023 Challenge