Abstract:Abstract Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet .

LViT: Language meets Vision Transformer in Medical Image Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

MMViT-Seg: A Lightweight Transformer and CNN Fusion Network for COVID-19 Segmentation.

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

Bi-VLGM: Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation

Many Birds, One Stone: Medical Image Segmentation with Multiple Partially Labeled Datasets

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

Enhancing medical text detection with vision-language pre-training and efficient segmentation

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

LATrans-Unet: Improving CNN-Transformer with Location Adaptive for Medical Image Segmentation.

UT-MT: A Semi-Supervised Model of Fusion Transformer for 3D Medical Image Segmentation

ViT-UperNet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation

PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data.

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation