GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition

Hui Zhang,Guiyang Luo,Jian Kang,Shan Huang,Xiao Wang,Fei-Yue Wang
DOI: https://doi.org/10.1109/tnnls.2023.3239696
IF: 14.255
2023-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Recent years have witnessed the growing popularity of connectionist temporal classification (CTC) and attention mechanism in scene text recognition (STR). CTC-based methods consume less time with few computational burdens, while they are not as effective as attention-based methods. To retain computational efficiency and effectiveness, we propose the global-local attention-augmented light Transformer (GLaLT), which adopts a Transformer-based encoder-decoder structure to orchestrate CTC and attention mechanism. The encoder integrates the self-attention module with the convolution module to augment the attention, where the self-attention module pays more attention to capturing long-term global dependencies and the convolution module focuses on local context modeling. The decoder consists of two parallel modules: one is the Transformer-decoder-based attention module and the other is the CTC module. The first one is removed in the testing phase and can guide the second one to extract robust features in the training phase. Extensive experiments on standard benchmarks demonstrate that GLaLT achieves state-of-the-art performance for both regular and irregular STR. In terms of tradeoffs, the proposed GLaLT is at or near the frontiers for maximizing speed, accuracy, and computational efficiency at the same time.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?