Abstract:Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at <a class="link-external link-https" href="https://github.com/wangmelon/CEMLT" rel="external noopener nofollow">this https URL</a>.

E2E-MLT - An Unconstrained End-to-End Method for Multi-language Scene Text

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning.

EMU: Effective Multi-Hot Encoding Net for Lightweight Scene Text Recognition with a Large Character Set.

Mlts: A Multi-Language Scene Text Spotter

A Multiplexed Network for End-to-End, Multilingual OCR

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Text detection and script identification in natural scene images using deep learning

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Cerebral Traumatism With a Playground Rocking Toy Mimicking Shaken Baby Syndrome

An end-to-end model for multi-view scene text recognition

MEAN: Multi - Element Attention Network for Scene Text Recognition

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

A Multitask Network for Localization and Recognition of Text in Images

Effective Multi-Hot Encoding and Classifier for Lightweight Scene Text Recognition with a Large Character Set

SEE: Towards Semi-Supervised End-to-End Scene Text Recognition

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

A New Language-Independent Deep CNN for Scene Text Detection and Style Transfer in Social Media Images

Complete Multilingual Neural Machine Translation

Research on Multilingual Natural Scene Text Detection Algorithm

Preterm birth in multiple pregnancy: a glimmer of hope?