WordSup: Exploiting Word Annotations for Character based Text Detection

Han Hu,Chengquan Zhang,Yuxuan Luo,Yuzhuo Wang,Junyu Han,Errui Ding
DOI: https://doi.org/10.48550/arXiv.1708.06720
2017-08-23
Abstract:Imagery texts are usually organized as a hierarchy of several visual elements, i.e. characters, words, text lines and text blocks. Among these elements, character is the most basic one for various languages such as Western, Chinese, Japanese, mathematical expression and etc. It is natural and convenient to construct a common text detection engine based on character detectors. However, training character detectors requires a vast of location annotated characters, which are expensive to obtain. Actually, the existing real text datasets are mostly annotated in word or line level. To remedy this dilemma, we propose a weakly supervised framework that can utilize word annotations, either in tight quadrangles or the more loose bounding boxes, for character detector training. When applied in scene text detection, we are thus able to train a robust character detector by exploiting word annotations in the rich large-scale real scene text datasets, e.g. ICDAR15 and COCO-text. The character detector acts as a key role in the pipeline of our text detection engine. It achieves the state-of-the-art performance on several challenging scene text detection benchmarks. We also demonstrate the flexibility of our pipeline by various scenarios, including deformed text detection and math expression recognition.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in scene text detection, how to use the word annotation information in the existing large - scale real - scene text datasets to train character detectors. Since character - level annotation data is very expensive and difficult to obtain, and most of the existing real - scene text datasets (such as ICDAR15 and COCO - Text) are annotated at the word or line level, the author proposes a weakly - supervised learning framework, aiming to train character detectors by using these word - level annotations. ### Specific problem description 1. **Insufficient training data for character detectors**: Training character detectors requires a large amount of character position annotation data, but obtaining these data is costly and time - consuming. 2. **Coarse annotation granularity in existing datasets**: Most of the existing large - scale real - scene text datasets are annotated at the word or line level, rather than at the character level. 3. **Importance of character detection**: Characters are the basic components of various languages (such as Western languages, Chinese, Japanese, mathematical expressions, etc.). Character - based detection can build a universal text detection engine and is applicable to multiple languages and scenarios. ### Solution To solve the above problems, the author proposes a weakly - supervised learning framework that can use word - level annotation information to train character detectors. The specific methods are as follows: - **Weakly - supervised framework**: Iteratively update the character center mask and the character model to gradually improve the performance of the character detector. - **Character mask generation**: Automatically generate character masks according to the current character model and word annotations. - **Character network update**: Use the generated character masks as supervision signals to update the character detection network. - **Multi - scale testing**: To deal with the problem of large changes in character size, a multi - scale testing strategy is adopted. ### Experimental verification The author conducted experiments on multiple benchmark datasets, including ICDAR13, ICDAR15, and COCO - Text. The results show that this method has achieved significant performance improvements in both character detection and scene text detection tasks. ### Summary The core contribution of this paper is to propose an effective weakly - supervised learning framework. It can fully utilize the existing word - level annotation data to train high - performance character detectors in the absence of character - level annotations, thereby improving the overall performance of scene text detection.