Abstract:Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. First, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Second, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Third, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Additionally, based on an ensemble of the iterative predictions, a self-training method is developed which can learn from unlabeled images effectively. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers. Code is available at https://github.com/FangShancheng/ABINet-PP.

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

Hear the Scene: Audio-Enhanced Text Spotting

ABINet Plus Plus : Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

LATextSpotter: Empowering Transformer Decoder with Length Perception Ability

Linguistic More: Taking a Further Step Toward Efficient and Accurate Scene Text Recognition.

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

Towards End-to-End Text Spotting in Natural Scenes

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance

DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

An Efficient Scene Text Spotter with a Feature-Level Super-Resolution Module

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting.

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Mask TextSpotter V3: Segmentation Proposal Network for Robust Scene Text Spotting

Scene Text Retrieval Via Joint Text Detection and Similarity Learning

A Cost-Efficient Framework for Scene Text Detection in the Wild

SPTS v2: Single-Point Scene Text Spotting