Abstract:Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at <a class="link-external link-https" href="https://github.com/wangmelon/CEMLT" rel="external noopener nofollow">this https URL</a>.

ESRNet: an exploring sample relationships network for arbitrary-shaped scene text detection

R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection

DSRN: A Deep Scale Relationship Network for Scene Text Detection.

A holistic representation guided attention network for scene text recognition

ReLaText: Exploiting Visual Relationships for Arbitrary-Shaped Scene Text Detection with Graph Convolutional Networks

CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes

Scene Text Detection Using HRNet and Spatial Attention Mechanism

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

ContourNet: Taking a Further Step Toward Accurate Arbitrary-shaped Scene Text Detection.

Boundary-aware Arbitrary-shaped Scene Text Detector with Learnable Embedding Network

A irregular text detection via dilated recombination and efficient reorganization on natural scene

CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer

DPNet: Scene text detection based on dual perspective CNN-transformer

EK-Net:Real-time Scene Text Detection with Expand Kernel Distance

Attentive Relational Networks for Mapping Images to Scene Graphs

Accurate Scene Text Detection Via Scale-Aware Data Augmentation and Shape Similarity Constraint

Aggregated Text Transformer for Scene Text Detection

CPN: Complementary Proposal Network for Unconstrained Text Detection

Research on Multilingual Natural Scene Text Detection Algorithm

Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection