Masked and Permuted Implicit Context Learning for Scene Text Recognition

Xiaomeng Yang,Zhi Qiao,Jin Wei,Dongbao Yang,Yu Zhou
DOI: https://doi.org/10.1109/lsp.2024.3381893
2024-04-05
IEEE Signal Processing Letters
Abstract:Scene Text Recognition (STR) is challenging because of various text styles, shapes, and backgrounds. Although the integration of linguistic information enhances models' performance, existing methods based on either permuted language modeling (PLM) or masked language modeling (MLM) have their drawbacks. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. To address these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM and incorporate word length information into the decoding process to integrate MLM, substituting the undetermined characters with mask tokens. Besides, we employ the perturbation training technique to train a more robust model against potential length prediction errors. Our comprehensive evaluations demonstrate the performance of our model. It achieves superior performance on the popularly used benchmarks and outperforms previous state-of-the-art methods with a substantial improvement of 9.1% on the more challenging Union14M-Benchmark.
engineering, electrical & electronic
What problem does this paper attempt to address?