Abstract:Imagery texts are usually organized as a hierarchy of several visual elements, i.e. characters, words, text lines and text blocks. Among these elements, character is the most basic one for various languages such as Western, Chinese, Japanese, mathematical expression and etc. It is natural and convenient to construct a common text detection engine based on character detectors. However, training character detectors requires a vast of location annotated characters, which are expensive to obtain. Actually, the existing real text datasets are mostly annotated in word or line level. To remedy this dilemma, we propose a weakly supervised framework that can utilize word annotations, either in tight quadrangles or the more loose bounding boxes, for character detector training. When applied in scene text detection, we are thus able to train a robust character detector by exploiting word annotations in the rich large-scale real scene text datasets, e.g. ICDAR15 and COCO-text. The character detector acts as a key role in the pipeline of our text detection engine. It achieves the state-of-the-art performance on several challenging scene text detection benchmarks. We also demonstrate the flexibility of our pipeline by various scenarios, including deformed text detection and math expression recognition.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in scene text detection, how to use the word annotation information in the existing large - scale real - scene text datasets to train character detectors. Since character - level annotation data is very expensive and difficult to obtain, and most of the existing real - scene text datasets (such as ICDAR15 and COCO - Text) are annotated at the word or line level, the author proposes a weakly - supervised learning framework, aiming to train character detectors by using these word - level annotations. ### Specific problem description 1. **Insufficient training data for character detectors**: Training character detectors requires a large amount of character position annotation data, but obtaining these data is costly and time - consuming. 2. **Coarse annotation granularity in existing datasets**: Most of the existing large - scale real - scene text datasets are annotated at the word or line level, rather than at the character level. 3. **Importance of character detection**: Characters are the basic components of various languages (such as Western languages, Chinese, Japanese, mathematical expressions, etc.). Character - based detection can build a universal text detection engine and is applicable to multiple languages and scenarios. ### Solution To solve the above problems, the author proposes a weakly - supervised learning framework that can use word - level annotation information to train character detectors. The specific methods are as follows: - **Weakly - supervised framework**: Iteratively update the character center mask and the character model to gradually improve the performance of the character detector. - **Character mask generation**: Automatically generate character masks according to the current character model and word annotations. - **Character network update**: Use the generated character masks as supervision signals to update the character detection network. - **Multi - scale testing**: To deal with the problem of large changes in character size, a multi - scale testing strategy is adopted. ### Experimental verification The author conducted experiments on multiple benchmark datasets, including ICDAR13, ICDAR15, and COCO - Text. The results show that this method has achieved significant performance improvements in both character detection and scene text detection tasks. ### Summary The core contribution of this paper is to propose an effective weakly - supervised learning framework. It can fully utilize the existing word - level annotation data to train high - performance character detectors in the absence of character - level annotations, thereby improving the overall performance of scene text detection.

WordSup: Exploiting Word Annotations for Character based Text Detection

WeText: Scene Text Detection under Weak Supervision

A Scene Text Detector for Text with Arbitrary Shapes

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Character Region Awareness for Text Detection

Scene Text Recognition Using Part-Based Tree-Structured Character Detection

BURSTS: A Bottom-Up Approach for Robust Spotting of Texts in Scenes.

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting.

A Robust Method: Arbitrary Shape Text Detection Combining Semantic and Position Information

Text Detection in Scene Images Based on Exhaustive Segmentation

MorphText: Deep Morphology Regularized Arbitrary-shape Scene Text Detection

Robust Text Detection in Natural Scene Images

TextDCT: Arbitrary-Shaped Text Detection Via Discrete Cosine Transform Mask.

Accurate Scene Text Detection Via Scale-Aware Data Augmentation and Shape Similarity Constraint

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Detecting Text in the Wild with Deep Character Embedding Network

Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes

EAST: An Efficient and Accurate Scene Text Detector

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

CDText: Scene Text Detector Based on Context-Aware Deformable Transformer

Characterness: an Indicator of Text in the Wild.