LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

Shuyi Ouyang,Jinyang Zhang,Xiangye Lin,Xilai Wang,Qingqing Chen,Yen-Wei Chen,Lanfen Lin
2024-09-03
Abstract:Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Traditional medical image segmentation methods are insufficient in helping doctors identify specific lesions for diagnosis and treatment. Specifically, these methods have difficulty in precisely locating and segmenting the specified lesion areas according to the given language expressions. Therefore, the author introduced a new task - Medical Image Referring Segmentation (MIRS), which requires segmenting specific lesions in medical images based on the given language expressions. ### Main Problems and Challenges 1. **Robust Visual - Language Modeling**: - The size and shape of objects in medical images vary greatly, which poses a challenge to the effective integration of visual and language features. - The fusion of single - scale visual features and language features may ignore the rich local visual information related to language guidance, thus affecting the target - locating performance of the model. 2. **Comprehensive Multi - Scale Interaction**: - Globally modeling the complex differences between different scales is helpful for extracting valuable global visual - language features. - The complex visual environment of medical images lacks complementary information, which may lead to the inability to fully identify the lesion boundaries during the segmentation process. ### Solutions To solve the above problems, the author proposed a new model named Language - guided Scale - aware MedSegmentor (LSMS), which contains two key designs: 1. **Scale - aware Vision - Language Attention (SVLA) Module**: - By using convolution kernels of different sizes to obtain rich visual knowledge and closely interact with language features, the lesion - locating ability is enhanced. - The SVLA module can process visual knowledge of different receptive fields and deeply fuse with language features, improving visual - language consistency. 2. **Full - Scale Decoder (FSD)**: - Globally model multi - modal features, align and integrate multi - modal feature maps across multiple scales, and enhance the understanding of details in the complex medical visual environment. - FSD improves the accurate prediction of lesion boundaries by aligning and integrating feature maps of different scales. In addition, to support the research of the MIRS task, the author also constructed a dataset named Reference Hepatic Lesion Segmentation (RefHL - Seg). This dataset contains 2,283 abdominal CT slices, covering various liver lesions in 231 cases, and provides corresponding text annotations and segmentation masks. ### Summary The main contributions of this paper include: 1. Introducing the MIRS task, aiming to locate and segment target objects in medical images according to reference expressions. 2. Proposing the LSMS model, which improves the lesion - locating ability and segmentation accuracy through the SVLA module and FSD design. 3. Constructing the RefHL - Seg dataset for training and validating the MIRS task. 4. Conducting extensive experiments on multiple datasets, proving that LSMS is superior to existing methods at a lower computational cost. Through these innovations, LSMS significantly improves the accuracy and practicality of medical image segmentation, especially improving the diagnostic efficiency of doctors in clinical practice.