Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models

Chun-Mei Feng
2024-07-07
Abstract:Aside from offering state-of-the-art performance in medical image generation, denoising diffusion probabilistic models (DPM) can also serve as a representation learner to capture semantic information and potentially be used as an image representation for downstream tasks, e.g., segmentation. However, these latent semantic representations rely heavily on labor-intensive pixel-level annotations as supervision, limiting the usability of DPM in medical image segmentation. To address this limitation, we propose an enhanced diffusion segmentation model, called TextDiff, that improves semantic representation through inexpensive medical text annotations, thereby explicitly establishing semantic representation and language correspondence for diffusion models. Concretely, TextDiff extracts intermediate activations of the Markov step of the reverse diffusion process in a pretrained diffusion model on large-scale natural images and learns additional expert knowledge by combining them with complementary and readily available diagnostic text information. TextDiff freezes the dual-branch multi-modal structure and mines the latent alignment of semantic features in diffusion models with diagnostic descriptions by only training the cross-attention mechanism and pixel classifier, making it possible to enhance semantic representation with inexpensive text. Extensive experiments on public QaTa-COVID19 and MoNuSeg datasets show that our TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to reduce the dependence on pixel - level annotations in medical image segmentation tasks, thereby improving the performance of the model with a small number of training samples. Specifically, the author proposes a new method - TextDiff, which enhances the semantic representation ability of the Denoising Diffusion Probabilistic Models (DPM) by introducing inexpensive medical text annotations, thus improving the effect of medical image segmentation. ### Problem Background 1. **Dependence on Pixel - level Annotations**: Traditional deep - learning methods usually require a large amount of pixel - level annotated data as a supervision signal in medical image segmentation tasks. However, obtaining high - quality medical images and their pixel - level annotations is very time - consuming and expensive. 2. **Limitations of Existing Methods**: Although some semi - supervised and weakly - supervised learning methods can reduce the dependence on annotated data, the effectiveness of these methods is often limited by the quality of pseudo - labels. Low - quality pseudo - labels will significantly affect the segmentation accuracy. ### Solution To overcome the above problems, the author proposes TextDiff, and its main contributions include: 1. **Introducing Medical Text Annotations**: By combining medical text annotations (such as diagnostic reports), TextDiff can use this inexpensive and easily - accessible information to enhance the semantic representation of images, thereby reducing the dependence on pixel - level annotations. 2. **Cross - Modal Attention Mechanism**: TextDiff aligns text features with intermediate activations in the diffusion model through a cross - modal attention mechanism, further enhancing the visual - semantic representation ability. 3. **Freezing the Dual - Branch Structure**: TextDiff only trains the cross - attention mechanism and the pixel classifier, while freezing the weights of the text encoder and the image encoder, thereby maintaining visual - language alignment and improving the model's generalization ability. ### Experimental Results The author conducted experiments on multiple public datasets (such as QaTa - COVID19 and MoNuSeg). The results show that TextDiff significantly outperforms existing multi - modal segmentation methods when only a small number of training samples are used. For example, on the MoNuSeg dataset, the Dice coefficient of TextDiff increased from 66.38% to 78.67%, and the IoU increased from 49.83% to 64.98%. ### Summary By introducing medical text annotations, TextDiff effectively enhances the semantic representation ability of the diffusion model, reduces the dependence on pixel - level annotations, and thus achieves better performance in medical image segmentation tasks.