Feature Pyramid Based Scene Text Detector

MengYi En,Rong Li,JianQiang Li,Bo Liu
DOI: https://doi.org/10.1109/icdar.2017.341
2017-01-01
Abstract:Features are critical for detecting texts in natural scene images. Nowadays most of scene text detection algorithm leverage powerful feature learning power of convolutional neural networks (CNNs) to learn discriminative features which could distinguish text from non-text well and perform detection based on these features. It is known that features from low layers of CNN are high-resolution but have low discriminative power and less semantic information; this compromises the representative capacity of the features. On the other hand, feature maps from high layers are discriminative but coarse-resolution, which harms the power for detecting small objects. In this paper, we present a feature pyramid based text detector (FPTD) for detecting scene texts at different scales, especially texts at small scales. Our framework is based on the state-of-the-art framework "Single Shot detector" (SSD), but not like SSD which performs detection on feature maps from later-stage of the network, which are coarse in resolution so they cannot get satisfied results on small objects. Our framework incorporates feature pyramid mechanism with SSD framework. Specifically, in the framework, we adopt a top-down fusion strategy to build new features with strong semantics while keep fine details. Text detections are conducted on multiple new constructed features respectively during a single forward pass. All detection results from each layer are gathered and undergo a non-maximum suppression (NMS) process. Since detection is conducted on feature maps from several layers which at different scales but are all discriminative, our framework has strong power to detect texts at different scales. Experimental results confirm that our framework achieves competitive performance on the ICDAR2013 text location benchmark and with marginal extra cost.
What problem does this paper attempt to address?