Abstract:Abstract Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet .

Long Text Classification with Segmentation

Long Text Classification Based on BERT

Hierarchical Transformers for Long Document Classification

Limitations of Transformers on Clinical Text Classification

Text Guide: Improving the quality of long text classification by a text selection method based on feature importance

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Revisiting Transformer-based Models for Long Document Classification

Efficient Classification of Long Documents Using Transformers

Hierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long text classification

Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Revisiting Text Guide, a Truncation Method for Long Text Classification

Improving the BERT model for long text sequences in question answering domain

Multidimensional Perceptron for Efficient and Explainable Long Text Classification

Comparative Study of Long Document Classification

SkIn: Skimming-Intensive Long-Text Classification Using BERT for Medical Corpus

CogLTX: Applying BERT to Long Texts.

SEGMENT+: Long Text Processing with Short-Context Language Models

A Survey on Long Text Modeling with Transformers

Enhancing medical text detection with vision-language pre-training and efficient segmentation

LordBERT: Embedding Long Text by Segment Ordering with BERT

A multi-semantic passing framework for semi-supervised long text classification