Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Hongmei Chi,Jiaxin Cai,Xinran Li
DOI: https://doi.org/10.1007/s00521-024-09493-5
2024-02-21
Neural Computing and Applications
Abstract:The sequence decoding framework has dominated the field of scene text recognition. In this framework, the RNN-based (recurrent neural network) decoder is one of the main approaches. The attention mechanism is a key module in the RNN-based decoder. In the decoding stage, the character is decoded based on an estimated attention map. The precision of the attention map is extremely important to the accuracy of the final output. In practice, we find the estimated attention map has encountered attention misalignment phenomena. To address this issue, in this paper, we innovatively propose Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition; we name it CASTER. We employ a thin plate spline transformation to rectify original images with oriented or curved texts and a 31-layer ResNet as backbone to extract visual features. Then, we leverage a two-stage decode mechanism: localization and decoding (coarse decoder) and re-localization and re-decoding (refined decoder) to predict the character sequence. We also introduce a novel context-enhanced encoder by a 2D contextual fusion module to capture the context information. The CASTER can localize the attention region of each character more accurately than the one-stage attention method and thus improve the final recognition performance. Extensive experiments show that CASTER achieves state-of-the-art performance on several standard benchmarks. Our method obtains, respectively, 96.1%, 93.3% and 94.4% recognition accuracies on regular (IIIT5K, SVT) and irregular (CUTE) text datasets.
computer science, artificial intelligence
What problem does this paper attempt to address?