An End-to-End Scene Text Detector with Dynamic Attention.

Jingyu Lin,Yan,Hanzi Wang
DOI: https://doi.org/10.1145/3551626.3564980
2022-01-01
Abstract:Detecting the arbitrarily oriented text in natural images is a challenging task in multimedia due to variations in text curvatures, orientations, and aspect ratios of natural scenes. Most previous scene text detectors often fail to locate the text instances which have a peculiar shape (an extreme aspect ratio) precisely. In this paper, we propose a dynamic end-to-end framework (DEF) which includes a convolution-based dynamic encoder (CDE) with various attention types to generate a deformable and dynamic view for multi-oriented text instances and curve ones. Different from previous methods that apply time-consuming post-processing steps like NMS, our method uses a Transformer-based decoder (TD) with a bipartite matching loss to model the relationship of corresponding queries and ground truths. As a result, by leveraging such a well-designed architecture, the receptive field will not be limited to a fixed shape, and a combination of global attention and local features provides a better representation for texts in natural scenes. We conduct extensive experiments qualitatively and quantitatively on several popular datasets. Experimental results show that the proposed method achieves superior performance compared with several state-of-the-art scene text detectors.
What problem does this paper attempt to address?