STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Wu Xing,Tang Bin,Zhao Ming,Wang Jianjia,Guo Yike
DOI: https://doi.org/10.1007/s10489-022-03728-5
IF: 5.3
2022-01-01
Applied Intelligence
Abstract:Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.
What problem does this paper attempt to address?