Improved Fusion of Visual and Semantic Representations by Gated Co-Attention for Scene Text Recognition.

Junwei Zhou,Xi Wang,Jiao Dai,Jizhong Han
DOI: https://doi.org/10.1145/3581807.3581837
2022-01-01
Abstract:Recognizing variations of text occurrences in scene photos is still difficult in the present day. In recent years, the performance of text recognition models based on the attention mechanism has vastly increased. However, these models typically focus on recognizing image regions or visual attention that are significant. In this paper, we present a unique paradigm for scene text recognition named gated co-attention. Using our suggested model, visual and semantic attention may be jointly reasoned. Given the visual features extracted by a convolutional network and the semantic features extracted by a language model, the first step involves combining the two sets of features. Second, the gated co-attention stage eliminates irrelevant visual characteristics and incorrect semantic data before fusing the knowledge of the two modalities. In addition, we analyze the performance of our model on several datasets, and the experimental results demonstrate that our method has outstanding performance on all seven datasets, with the best results reached on four datasets.
What problem does this paper attempt to address?