Abstract:Scene text detection methods based on neural networks have emerged recently and have shown promising results. Previous methods trained with rigid word-level bounding boxes exhibit limitations in representing the text region in an arbitrary shape. In this paper, we propose a new scene text detection method to effectively detect text area by exploring each character and affinity between characters. To overcome the lack of individual character level annotations, our proposed framework exploits both the given character-level annotations for synthetic images and the estimated character-level ground-truths for real images acquired by the learned interim model. In order to estimate affinity between characters, the network is trained with the newly proposed representation for affinity. Extensive experiments on six benchmarks, including the TotalText and CTW-1500 datasets which contain highly curved texts in natural images, demonstrate that our character-level text detection significantly outperforms the state-of-the-art detectors. According to the results, our proposed method guarantees high flexibility in detecting complicated scene text images, such as arbitrarily-oriented, curved, or deformed texts.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing scene text detection methods when dealing with texts of complex shapes (such as curved, deformed or texts in arbitrary directions). Specifically, traditional scene text detection methods based on neural networks mainly rely on rigid word - level bounding boxes for training, which leads to their limitations in representing text regions of arbitrary shapes. To overcome this challenge, the paper proposes a new scene text detection method - CRAFT (Character Region Awareness For Text detection), which can detect text regions more effectively by exploring each character and the affinity between them.
### Main Contributions
1. **Character - level Region Awareness**: The CRAFT framework utilizes character - level annotation information, not only for synthetic images, but also for estimating character - level ground truth in real images through the learned intermediate model. This method can better capture the detailed features of texts and improve the detection ability of texts with complex shapes.
2. **Weakly - Supervised Learning**: Since most existing text datasets do not provide character - level annotations, the paper proposes a weakly - supervised learning framework, which generates character - level pseudo - ground - truth from word - level annotations, thus compensating for the lack of character - level annotations.
3. **High Flexibility**: The experimental results show that CRAFT significantly outperforms the existing state - of - the - art detectors on multiple benchmark datasets, especially showing higher flexibility when dealing with long texts, curved texts or texts of arbitrary shapes.
### Method Overview
- **Network Architecture**: CRAFT is based on the VGG - 16 backbone network, adopts a fully - convolutional network structure, and introduces skip connections in the decoding part to aggregate low - level features. The final output includes score maps of two channels: character - region score and inter - character affinity score.
- **Training Strategy**:
- **Ground - Truth Label Generation**: For each training image, generate the ground - truth labels of character - region score and affinity score. The probability of the character center is represented by a Gaussian heat map to improve the flexibility in handling non - rigid boundary ground - truth regions.
- **Weakly - Supervised Learning**: For real images, generate character - level pseudo - ground - truth from word - level annotations. Calculate the ratio of the detected number of characters to the real number of characters to reflect the reliability of the intermediate model prediction, and use it for training weights.
- **Inference Process**: In the inference stage, CRAFT can generate text boxes of various shapes, such as word boxes or character boxes, or even polygons, through simple post - processing steps. In particular, CRAFT does not require additional post - processing methods, such as non - maximum suppression (NMS), because the word regions have been separated by connected - component labeling (CCL).
### Experimental Results
- **Quadrilateral - type Datasets**: The experimental results on the ICDAR and MSRA - TD500 datasets show that CRAFT has achieved the state - of - the - art performance on all datasets, and has reached a processing speed of 8.6 FPS on the IC13 dataset.
- **Polygon - type Datasets**: The experimental results on the TotalText and CTW - 1500 datasets further verify the superior performance of CRAFT in dealing with texts of arbitrary shapes (especially curved texts). In particular, in the CTW - 1500 dataset, by introducing a small link refinement network (LinkRefiner), CRAFT performs particularly well in dealing with long texts.
### Discussion
- **Robustness to Scale Changes**: CRAFT has only carried out single - scale experiments on all datasets, even though the sizes of texts are highly diverse. This is because CRAFT locates individual characters rather than the entire text, and a smaller receptive field is sufficient to cover individual characters in large images, making it robust in detecting texts with scale changes.
- **Multilingual Issues**: The IC17 dataset contains Bengali and Arabic characters, which do not appear in the synthetic text datasets and are difficult to be segmented into individual characters. Therefore, CRAFT's performance in recognizing these characters is not as good as that in recognizing Latin, Korean, Chinese and Japanese characters.
In general, by introducing character - level region awareness and a weakly - supervised learning framework, CRAFT significantly improves the performance of scene text detection, especially when dealing with texts of complex shapes.