Abstract:The exploration of linguistic information promotes the development of scene text recognition task. Benefiting from the significance in parallel reasoning and global relationship capture, transformer-based language model (TLM) has achieved dominant performance recently. As a decoupled structure from the recognition process, we argue that TLM's capability is limited by the input low-quality visual prediction. To be specific: 1) The visual prediction with low character-wise accuracy increases the correction burden of TLM. 2) The inconsistent word length between visual prediction and original image provides a wrong language modeling guidance in TLM. In this paper, we propose a Progressive scEne Text Recognizer (PETR) to improve the capability of transformer-based language model by handling above two problems. Firstly, a Destruction Learning Module (DLM) is proposed to consider the linguistic information in the visual context. DLM introduces the recognition of destructed images with disordered patches in the training stage. Through guiding the vision model to restore patch orders and make word-level prediction on the destructed images, visual prediction with high character-wise accuracy is obtained by exploring inner relationship between the local visual patches. Secondly, a new Language Rectification Module (LRM) is proposed to optimize the word length for language guidance rectification. Through progressively implementing LRM in different language modeling steps, a novel progressive rectification network is constructed to handle some extremely challenging cases (e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR enhances the capability of transformer-based language model from a more general aspect, that is, focusing on the reduction of correction burden and rectification of language modeling guidance. Compared with parallel transformer-based methods, PETR obtains 1.0% and 0.8% imp- ovement on regular and irregular datasets respectively while introducing only 1.7M additional parameters. The extensive experiments on both English and Chinese benchmarks demonstrate that PETR achieves the state-of-the-art results.

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Pure Transformer with Integrated Experts for Scene Text Recognition

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Multi-Granularity Prediction for Scene Text Recognition

Aggregated Text Transformer for Scene Text Detection

Scene Text Recognition with Cascade Attention Network.

SVTR: Scene Text Recognition with a Single Visual Model

Transforming Scene Text Detection and Recognition: A Multi-Scale End-to-End Approach With Transformer Framework

Flexible scene text recognition based on dual attention mechanism

Exploring Font-independent Features for Scene Text Recognition

Visual-Semantic Transformer for Scene Text Recognition

Scene Text Telescope: Text-Focused Scene Image Super-Resolution

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition

OTE: Exploring Accurate Scene Text Recognition Using One Token

Batch-transformer for scene text image super-resolution

CSTR: A Classification Perspective on Scene Text Recognition.

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Instruction-Guided Scene Text Recognition