Abstract:Recently, scene Chinese recognition has attracted increasing attention. While mainstream scene text recognition methods exhibit outstanding performance in English recognition, they are considerably limited in Chinese recognition, due to inter-class similarity, intra-class variability, and complex combination of components in scene Chinese text. In this paper, we design Adaptive Position Encoding(APE) to enhance the model's ability to perceive spatial information. Based on APE, we have innovatively designed Local Attention Module (LAM) and Global Attention Module (GAM). Specifically, LAM captures local features to identify common characteristics among characters of the same category, addressing the issue of intra-class variability. Meanwhile, LAM captures global features to identify the subordination relationships of Chinese character components. By integrating LAM and GAM, combining both local and global features, it is possible to find differences in the details among features that are fundamentally similar, thus solving the problem of inter-class similarity. Further, we contrive the transformer encoder–decoder structure to identify the vast variety of Chinese characters. Based on the Local/Global Attention Module and transformer encoder–decoder framework, we devise the novel sequence-to-sequence Local and Global Attention Network(LGANet), where both the backbone and the encoder/decoder are composed of attention mechanisms. Subsequent experiments on the Chinese scene dataset show that the recognition accuracy of our proposed LGANet is 77.3% and the normalized editing distance is 88.6%, both of which achieve the SOTA results in Fig. 1 .

Multi-Scale Channel Attention for Chinese Scene Text Recognition.

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Learning and Fusing Multi-Scale Representations for Accurate Arbitrary-Shaped Scene Text Recognition.

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Scene Text Recognition Via Gated Cascade Attention

DSRN: A Deep Scale Relationship Network for Scene Text Detection.

Scene Chinese Recognition with Local and Global Attention

Deep Neural Network with Attention Model for Scene Text Recognition.

Scene Text Recognition with Cascade Attention Network.

Scene Text Recognition from Two-Dimensional Perspective

A Text-Context-Aware CNN Network for Multi-oriented and Multi-language Scene Text Detection.

MTSTR: Multi-task learning for low-resolution scene text recognition via dual attention mechanism and its application in logistics industry

Flexible scene text recognition based on dual attention mechanism

Efficient Neural Network for Text Recognition in Natural Scenes Based on End-to-End Multi-Scale Attention Mechanism

Aggregated Text Transformer for Scene Text Detection

A Multi-Level Feature Fusion Network for Scene Text Detection with Text Attention Mechanism

Hierarchical Refined Attention for Scene Text Recognition.

Convolutional Attention Networks for Scene Text Recognition

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

A holistic representation guided attention network for scene text recognition