Increasing SAM Zero-Shot Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

Zekun Jiang,Dongjie Cheng,Ziyuan Qin,Jun Gao,Qicheng Lao,Kang Li,Le Zhang
DOI: https://doi.org/10.48550/arXiv.2402.15759
2024-02-24
Computer Vision and Pattern Recognition
Abstract:This study develops and evaluates a novel multimodal medical image zero-shot segmentation algorithm named Text-Visual-Prompt SAM (TV-SAM) without any manual annotations. TV-SAM incorporates and integrates large language model GPT-4, Vision Language Model GLIP, and Segment Anything Model (SAM), to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing SAM for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training, significantly outperforming SAM AUTO and GSAM, closely matching the performance of SAM BBOX with gold standard bounding box prompts, and surpassing the state-of-the-art on specific datasets like ISIC and WBC. The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, it could enhance the capability to address complex problems in specialized domains. The code is available at: https://github.com/JZK00/TV-SAM.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of zero-shot segmentation in multimodal medical images. Specifically: 1. **Proposing a New Zero-Shot Segmentation Algorithm**: The research team developed a new algorithm called Text-Visual-Prompt SAM (TV-SAM). This algorithm combines the large language model GPT-4, the vision-language model GLIP, and the Segment Anything Model (SAM) to achieve automatic text and visual prompt generation without manual annotation. 2. **Enhancing SAM's Performance in Zero-Shot Segmentation**: By integrating GPT-4 to automatically generate descriptive text prompts and visual bounding box prompts, the performance of SAM in zero-shot segmentation tasks is enhanced. 3. **Validating the Algorithm's Effectiveness**: The study conducted extensive evaluations on 7 public datasets, covering 8 different medical imaging modalities. It demonstrated that TV-SAM can effectively segment unseen targets and outperformed current state-of-the-art methods on specific datasets such as ISIC and WBC. In summary, this paper attempts to solve the challenge of zero-shot segmentation in multimodal medical images by combining multiple foundational models, thereby improving segmentation accuracy and efficiency.