SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du,Zhineng Chen,Hongtao Xie,Caiyan Jia,Yu-Gang Jiang
2024-11-24
Abstract:Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at <a class="link-external link-https" href="https://github.com/Topdu/OpenOCR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Scene Text Recognition (STR): 1. **Dealing with text irregularities**: Existing methods based on Connectionist Temporal Classification (CTC) perform poorly in dealing with text irregularities, such as text distortion, layout changes, etc. These problems lead to a lower recognition accuracy of CTC models in complex scenes. 2. **Integrating language context**: CTC models usually do not encode language context information, in contrast to Encoder - Decoder (EDTRs) - based methods. EDTRs effectively integrate multiple modal cues such as visual, language, and location through the attention mechanism, thus performing better in complex scenes. To solve the above problems, the paper proposes SVTRv2, an improved CTC model. SVTRv2 introduces the following innovations: - **Multi - Size Resizing (MSR)**: Adaptively adjust the size of the text image according to the aspect ratio of the text to reduce text distortion caused by fixed - size adjustment and improve the quality of the extracted visual features. - **Feature Rearrangement Module (FRM)**: Horizontally and vertically rearrange visual features to better meet the requirements of CTC alignment, thereby alleviating the CTC alignment problem and improving the ability to recognize irregular texts. - **Semantic Guidance Module (SGM)**: Introduce language context information during the training phase to guide the visual model to learn to perceive language context without increasing the inference cost. This enables the CTC model to utilize language information and improve recognition accuracy. Through these improvements, SVTRv2 not only exceeds existing EDTRs in recognition accuracy but also performs well in inference speed, thus solving the trade - off problem among accuracy, speed, and generality of current STR models.