SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du,Zhineng Chen,Hongtao Xie,Caiyan Jia,Yu-Gang Jiang

2024-11-24

Abstract:Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at <a class="link-external link-https" href="https://github.com/Topdu/OpenOCR" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Scene Text Recognition (STR): 1. **Dealing with text irregularities**: Existing methods based on Connectionist Temporal Classification (CTC) perform poorly in dealing with text irregularities, such as text distortion, layout changes, etc. These problems lead to a lower recognition accuracy of CTC models in complex scenes. 2. **Integrating language context**: CTC models usually do not encode language context information, in contrast to Encoder - Decoder (EDTRs) - based methods. EDTRs effectively integrate multiple modal cues such as visual, language, and location through the attention mechanism, thus performing better in complex scenes. To solve the above problems, the paper proposes SVTRv2, an improved CTC model. SVTRv2 introduces the following innovations: - **Multi - Size Resizing (MSR)**: Adaptively adjust the size of the text image according to the aspect ratio of the text to reduce text distortion caused by fixed - size adjustment and improve the quality of the extracted visual features. - **Feature Rearrangement Module (FRM)**: Horizontally and vertically rearrange visual features to better meet the requirements of CTC alignment, thereby alleviating the CTC alignment problem and improving the ability to recognize irregular texts. - **Semantic Guidance Module (SGM)**: Introduce language context information during the training phase to guide the visual model to learn to perceive language context without increasing the inference cost. This enables the CTC model to utilize language information and improve recognition accuracy. Through these improvements, SVTRv2 not only exceeds existing EDTRs in recognition accuracy but also performs well in inference speed, thus solving the trade - off problem among accuracy, speed, and generality of current STR models.

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

SVTR: Scene Text Recognition with a Single Visual Model

SVTR-SRNet: A Deep Learning Model for Scene Text Recognition via SVTR Framework and Spatial Reduction Mechanism

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

TSO-DETR: A Network for Small Object Detection of Cervical Cells in TCT Smear

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

CSTR: A Classification Perspective on Scene Text Recognition.

Instruction-Guided Scene Text Recognition

Decoder Pre-Training with only Text for Scene Text Recognition

2D-CTC for Scene Text Recognition

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Revisiting Classification Perspective on Scene Text Recognition

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

Character Region Awareness Network for Scene Text Recognition

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Multi-Granularity Prediction for Scene Text Recognition

OTE: Exploring Accurate Scene Text Recognition Using One Token

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition