Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

Xiaoshuang Huang,Hongxiang Li,Meng Cao,Long Chen,Chenyu You,Dong An

2024-07-08

Abstract:Recent developments underscore the potential of textual information in enhancing learning models for a deeper understanding of medical visual semantics. However, language-guided medical image segmentation still faces a challenging issue. Previous works employ implicit and ambiguous architectures to embed textual information. This leads to segmentation results that are inconsistent with the semantics represented by the language, sometimes even diverging significantly. To this end, we propose a novel cross-modal conditioned Reconstruction for Language-guided Medical Image Segmentation (RecLMIS) to explicitly capture cross-modal interactions, which assumes that well-aligned medical visual features and medical notes can effectively reconstruct each other. We introduce conditioned interaction to adaptively predict patches and words of interest. Subsequently, they are utilized as conditioning factors for mutual reconstruction to align with regions described in the medical notes. Extensive experiments demonstrate the superiority of our RecLMIS, surpassing LViT by 3.74% mIoU on the publicly available MosMedData+ dataset and achieving an average increase of 1.89% mIoU for cross-domain tests on our QATA-CoV19 dataset. Simultaneously, we achieve a relative reduction of 20.2% in parameter count and a 55.5% decrease in computational load. The code will be available at <a class="link-external link-https" href="https://github.com/ShashankHuang/RecLMIS" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily aims to address the following key issues: 1. **Enhancing Medical Image Segmentation with Text Information**: Recent developments have shown the potential of text information in deepening the understanding of medical visual semantics. However, effectively integrating text information into medical image segmentation remains a challenge. 2. **Addressing Issues with Existing Methods**: Existing methods use implicit and ambiguous architectures to embed text information, leading to segmentation results that are semantically inconsistent with the text descriptions, sometimes even showing significant deviations. 3. **Proposing a New Cross-Modal Conditional Reconstruction Method**: To overcome the above challenges, the paper proposes a novel cross-modal conditional reconstruction method named "RecLMIS" for language-guided medical image segmentation tasks. This method assumes that well-aligned medical visual features and medical notes can effectively reconstruct each other. 4. **Improving Model Performance While Reducing Parameter Count and Computational Load**: Through experimental validation, RecLMIS not only surpasses baseline methods (such as LViT) on the publicly available MosMedData+ dataset but also achieves a 20.2% reduction in the number of parameters and a 55.5% reduction in computational load. In summary, the core contribution of this paper lies in proposing a new method to improve language-guided medical image segmentation tasks. This method can more effectively capture the interaction between text and images while reducing computational costs and maintaining high performance.

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training

Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

Complementary Information Mutual Learning for Multimodality Medical Image Segmentation

LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

LViT: Language meets Vision Transformer in Medical Image Segmentation

Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

MASS: Modality-collaborative semi-supervised segmentation by exploiting cross-modal consistency from unpaired CT and MRI images

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

Cross-View Mutual Learning for Semi-Supervised Medical Image Segmentation

LIMIS: Towards Language-based Interactive Medical Image Segmentation

Advancing MRI Segmentation with CLIP-driven Semi-Supervised Learning and Semantic Alignment

Enhancing Cross-Modal Medical Image Segmentation through Compositionality

ASIMSA: Advanced Semantic Information Guided Multi-Scale Alignment Framework for Medical Vision-Language Pretraining

Cross-Modal Causal Intervention for Medical Report Generation

Linear semantic transformation for semi-supervised medical image segmentation

Towards Cross-modality Medical Image Segmentation with Online Mutual Knowledge Distillation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Semi-MedSeq: Semi-supervised Semantic Segmentation for Medical Image Sequences.