Abstract:Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Traditional medical image segmentation methods are insufficient in helping doctors identify specific lesions for diagnosis and treatment. Specifically, these methods have difficulty in precisely locating and segmenting the specified lesion areas according to the given language expressions. Therefore, the author introduced a new task - Medical Image Referring Segmentation (MIRS), which requires segmenting specific lesions in medical images based on the given language expressions. ### Main Problems and Challenges 1. **Robust Visual - Language Modeling**: - The size and shape of objects in medical images vary greatly, which poses a challenge to the effective integration of visual and language features. - The fusion of single - scale visual features and language features may ignore the rich local visual information related to language guidance, thus affecting the target - locating performance of the model. 2. **Comprehensive Multi - Scale Interaction**: - Globally modeling the complex differences between different scales is helpful for extracting valuable global visual - language features. - The complex visual environment of medical images lacks complementary information, which may lead to the inability to fully identify the lesion boundaries during the segmentation process. ### Solutions To solve the above problems, the author proposed a new model named Language - guided Scale - aware MedSegmentor (LSMS), which contains two key designs: 1. **Scale - aware Vision - Language Attention (SVLA) Module**: - By using convolution kernels of different sizes to obtain rich visual knowledge and closely interact with language features, the lesion - locating ability is enhanced. - The SVLA module can process visual knowledge of different receptive fields and deeply fuse with language features, improving visual - language consistency. 2. **Full - Scale Decoder (FSD)**: - Globally model multi - modal features, align and integrate multi - modal feature maps across multiple scales, and enhance the understanding of details in the complex medical visual environment. - FSD improves the accurate prediction of lesion boundaries by aligning and integrating feature maps of different scales. In addition, to support the research of the MIRS task, the author also constructed a dataset named Reference Hepatic Lesion Segmentation (RefHL - Seg). This dataset contains 2,283 abdominal CT slices, covering various liver lesions in 231 cases, and provides corresponding text annotations and segmentation masks. ### Summary The main contributions of this paper include: 1. Introducing the MIRS task, aiming to locate and segment target objects in medical images according to reference expressions. 2. Proposing the LSMS model, which improves the lesion - locating ability and segmentation accuracy through the SVLA module and FSD design. 3. Constructing the RefHL - Seg dataset for training and validating the MIRS task. 4. Conducting extensive experiments on multiple datasets, proving that LSMS is superior to existing methods at a lower computational cost. Through these innovations, LSMS significantly improves the accuracy and practicality of medical image segmentation, especially improving the diagnostic efficiency of doctors in clinical practice.

LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

MsVRL: Self-Supervised Multiscale Visual Representation Learning Via Cross-Level Consistency for Medical Image Segmentation

LIMIS: Towards Language-based Interactive Medical Image Segmentation

Many Birds, One Stone: Medical Image Segmentation with Multiple Partially Labeled Datasets

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

MedLSAM: Localize and Segment Anything Model for 3D Medical Images

LViT: Language meets Vision Transformer in Medical Image Segmentation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

SEG-LUS: A Novel Ultrasound Segmentation Method for Liver and its Accessory Structures Based on Muti-head Self-Attention

MedLSAM: Localize and Segment Anything Model for 3D CT Images

TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

Bi-VLGM: Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation

Specific Instance and Cross-Prompt-Based Robust 3-D Semi-Supervised Medical Image Segmentation

Adaptive Interactive Segmentation for Multimodal Medical Imaging via Selection Engine

PFPs: Prompt-guided Flexible Pathological Segmentation for Diverse Potential Outcomes Using Large Vision and Language Models

Segment as You Wish -- Free-Form Language-Based Segmentation for Medical Images

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

MedSeq: Semantic Segmentation for Medical Image Sequences

Reliable segmentation of multiple lesions from medical images

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Leveraging Task-Specific Knowledge from LLM for Semi-Supervised 3D Medical Image Segmentation