Abstract:Abstract Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet .

What problem does this paper attempt to address?

This paper attempts to solve the problem of detecting text in medical images. Specifically, due to the complex text background, high text density, and possible extreme aspect ratios in medical images, detecting text in these images becomes very challenging. The paper proposes an effective and accurate text - detection system, aiming to overcome these challenges. This system combines an optimized segmentation module, a trainable post - processing method, and utilizes the Vision - Language Pretraining Model (oCLIP). By introducing the residual structure and channel - attention mechanism into the Feature Pyramid Network (FPN) module, as well as designing the Multi - Scale Feature Fusion Module (MSFM - RSE) and the Efficient Feature Enhancement Module (EFEM), the system can better handle multi - scale feature fusion and improve the segmentation performance. In addition, by introducing the Vision - Language Pretraining Model, the system is further enhanced in terms of the accuracy and robustness of text detection. ### Main contributions: 1. **Efficient Feature Enhancement Module**: The Efficient Feature Enhancement Module (EFEM) and the Multi - Scale Feature Fusion Module (MSFM - RSE) are proposed. Based on the residual structure and the spatial - channel attention mechanism, they significantly enhance the feature representation ability of the network. 2. **Vision - Language Pretraining Model**: The model pre - trained with large - scale visual - language understanding tasks is utilized to improve the representation ability of the system, thereby enhancing the accuracy and robustness of text detection. 3. **Excellent Experimental Results**: Experiments are carried out on five publicly available scene - text detection datasets, demonstrating the competitiveness of this method in terms of efficiency, accuracy, F - score, and robustness. In particular, its performance on medical - image - text datasets is better than that of existing methods. ### Specific problems solved: - **Text Detection in Complex Backgrounds**: The text background in medical images is complex, and traditional text - detection methods are difficult to accurately identify. - **High - Density Text Distribution**: High - density text distribution often occurs in medical images, and existing algorithms are difficult to accurately locate and segment all text instances. - **Extreme Aspect Ratios**: The text in medical images may have extreme aspect ratios, which brings additional difficulties to detection. - **Multilingual and Symbols**: Medical images may contain different languages and symbols, increasing the difficulty of detection. Through the above methods, the system proposed in the paper performs excellently in dealing with these problems, providing strong support for medical - image analysis, document processing, intelligent information retrieval, and other fields.

Enhancing medical text detection with vision-language pre-training and efficient segmentation

LViT: Language meets Vision Transformer in Medical Image Segmentation

Medical Vision-Language Pre-Training for Brain Abnormalities

Multiscale Progressive Text Prompt Network for Medical Image Segmentation

SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

Bi-VLGM: Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation

Enhancing Medical Image Segmentation with a Lightweight Boundary-Aware Multitask Detection Head

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM

MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

MSDEnet: Multi-scale detail enhanced network based on human visual system for medical image segmentation

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Improving Medical Vision-Language Contrastive Pretraining with Semantics-aware Triage

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

TSE DeepLab: An efficient visual transformer for medical image segmentation