TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

Yihao Zhao,Enhao Zhong,Cuiyun Yuan,Yang Li,Man Zhao,Chunxia Li,Jun Hu,Chenbin Liu
2024-09-05
Abstract:We propose TG-LMM (Text-Guided Large Multi-Modal Model), a novel approach that leverages textual descriptions of organs to enhance segmentation accuracy in medical images. Existing medical image segmentation methods face several challenges: current medical automatic segmentation models do not effectively utilize prior knowledge, such as descriptions of organ locations; previous text-visual models focus on identifying the target rather than improving the segmentation accuracy; prior models attempt to use prior knowledge to enhance accuracy but do not incorporate pre-trained models. To address these issues, TG-LMM integrates prior knowledge, specifically expert descriptions of the spatial locations of organs, into the segmentation process. Our model utilizes pre-trained image and text encoders to reduce the number of training parameters and accelerate the training process. Additionally, we designed a comprehensive image-text information fusion structure to ensure thorough integration of the two modalities of data. We evaluated TG-LMM on three authoritative medical image datasets, encompassing the segmentation of various parts of the human body. Our method demonstrated superior performance compared to existing approaches, such as MedSAM, SAM and nnUnet.
Computer Vision and Pattern Recognition,Medical Physics
What problem does this paper attempt to address?
The paper aims to address several key issues in medical image segmentation and proposes a new method—Text-Guided Large-Scale Multimodal Model (TG-LMM) to improve the accuracy of medical image segmentation. Specifically, existing medical image segmentation methods face the following challenges: 1. Current automatic segmentation models fail to effectively utilize prior knowledge, such as descriptions of organ locations. 2. Most existing text-visual models mainly focus on object recognition rather than improving segmentation accuracy. 3. Although some models attempt to use prior knowledge to enhance accuracy, they do not incorporate pre-trained models. To address these issues, the TG-LMM model integrates expert descriptions of organ spatial locations, incorporating this prior knowledge into the segmentation process. The model uses pre-trained image encoders and text encoders to reduce the number of training parameters and accelerate the training process. Additionally, a comprehensive image-text information fusion structure is designed to ensure the full integration of both modalities of data. Experimental results show that TG-LMM outperforms existing methods such as MedSAM, SAM, and nnUnet on multiple authoritative medical image datasets.