Necessity and Impact of Specialization of Large Foundation Model for Medical Segmentation Tasks

Eric Nguyen,Hengjie Liu,Dan Ruan
DOI: https://doi.org/10.1101/2024.06.02.597036
2024-06-03
Abstract:Background: Large foundation models, such as the Segment Anything Model (SAM), have shown remarkable performance in image segmentation tasks. However, the optimal approach to achieve true utility of these models for domain-specific applications, such as medical image segmentation, remains an open question. Recent studies have released a medical version of the foundation model MedSAM by training on vast medical data, who promised SOTA medical segmentation. Independent community inspection and dissection is needed. Purpose: This study assesses the performance of off-the-shelf medical foundation model MedSAM for the segmentation of anatomical structures in pelvic MR images. We also evaluate the dependency on prompting scheme and demonstrate the gain of further specialized fine-tuning. Methods: MedSAM and its lightweight version LiteMedSAM were evaluated out-of-the-box on a public MR dataset consisting of 589 pelvic images split 80:20 for training and testing. An nnU-Net model was trained from scratch to serve as a benchmark and to provide bounding box prompts for MedSAM. MedSAM was evaluated using different quality bounding boxes, those derived from ground truth labels, those derived from nnU-Net, and those derived from the former two but with 5-pixel isometric expansion. Lastly, LiteMedSAM was refined on the training set and reevaluated on this task. Results: Out-of-the-box MedSAM and LiteMedSAM both performed poorly across the structure set, especially for disjoint or non-convex structures. Varying prompt with different bounding box inputs had minimal effect. The mean Dice score and mean Hausdorff distances (in mm) for obturator internus using MedSAM and LiteMedSAM were {0.251 +/- 0.110, 0.101 +/- 0.079} and {34.142 +/- 5.196, 33.688 +/- 5.306}, respectively. Fine-tuning of LiteMedSAM led to significant performance gain, improving Dice score and Hausdorff distance for the obturator internus to 0.864 +/- 0.123 and 5.022 +/- 10.684, on par with nnU-Net with no significant difference in evaluation of most structures. All segmentation structures all benefited significantly from specialized refinement, at varying improvement margin. Conclusion: Our study alludes to the potential of deep learning models like MedSAM and lite MedSAM for medical segmentation but also highlight the need for specialized refinement and adjudication: it is quite likely that off-the-shelf use of such large foundation models may be suboptimal, and specialized fine-tuning can significantly enhance segmentation accuracy.
Bioengineering
What problem does this paper attempt to address?