Enhancing medical text detection with vision-language pre-training and efficient segmentation

Tianyang Li,Jinxu Bai,Qingzhu Wang
DOI: https://doi.org/10.1007/s40747-024-01378-3
IF: 6.7
2024-02-29
Complex & Intelligent Systems
Abstract:Abstract Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet .
computer science, artificial intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of detecting text in medical images. Specifically, due to the complex text background, high text density, and possible extreme aspect ratios in medical images, detecting text in these images becomes very challenging. The paper proposes an effective and accurate text - detection system, aiming to overcome these challenges. This system combines an optimized segmentation module, a trainable post - processing method, and utilizes the Vision - Language Pretraining Model (oCLIP). By introducing the residual structure and channel - attention mechanism into the Feature Pyramid Network (FPN) module, as well as designing the Multi - Scale Feature Fusion Module (MSFM - RSE) and the Efficient Feature Enhancement Module (EFEM), the system can better handle multi - scale feature fusion and improve the segmentation performance. In addition, by introducing the Vision - Language Pretraining Model, the system is further enhanced in terms of the accuracy and robustness of text detection. ### Main contributions: 1. **Efficient Feature Enhancement Module**: The Efficient Feature Enhancement Module (EFEM) and the Multi - Scale Feature Fusion Module (MSFM - RSE) are proposed. Based on the residual structure and the spatial - channel attention mechanism, they significantly enhance the feature representation ability of the network. 2. **Vision - Language Pretraining Model**: The model pre - trained with large - scale visual - language understanding tasks is utilized to improve the representation ability of the system, thereby enhancing the accuracy and robustness of text detection. 3. **Excellent Experimental Results**: Experiments are carried out on five publicly available scene - text detection datasets, demonstrating the competitiveness of this method in terms of efficiency, accuracy, F - score, and robustness. In particular, its performance on medical - image - text datasets is better than that of existing methods. ### Specific problems solved: - **Text Detection in Complex Backgrounds**: The text background in medical images is complex, and traditional text - detection methods are difficult to accurately identify. - **High - Density Text Distribution**: High - density text distribution often occurs in medical images, and existing algorithms are difficult to accurately locate and segment all text instances. - **Extreme Aspect Ratios**: The text in medical images may have extreme aspect ratios, which brings additional difficulties to detection. - **Multilingual and Symbols**: Medical images may contain different languages and symbols, increasing the difficulty of detection. Through the above methods, the system proposed in the paper performs excellently in dealing with these problems, providing strong support for medical - image analysis, document processing, intelligent information retrieval, and other fields.