Abstract:Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at <a class="link-external link-https" href="https://github.com/cxfyxl/VIPTR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in Scene Text Recognition (STR), including: 1. **Improving recognition accuracy**: Although the current state - of - the - art (SOTA) STR models show high performance, they usually rely on a hybrid architecture consisting of a visual encoder and a sequence decoder, resulting in low inference efficiency. The paper proposes a new method, SVIPTR (VIsion Permutable Extractor for Scene Text Recognition), aiming to significantly improve the inference speed while maintaining high accuracy. 2. **Enhancing the lightweight and efficiency of the model**: Existing STR models are often difficult to be deployed on resource - limited devices due to their complex structures. SVIPTR realizes the lightweight and efficiency of the model by designing a vision - semantic extractor with a pyramid structure and combining the permutations and combinations of local and global self - attention layers. This design makes the inference speed of SVIPTR insensitive to the input length, thus being suitable for text recognition tasks of different lengths. 3. **Supporting cross - language recognition**: SVIPTR can not only accurately recognize English scene texts, but also effectively process Chinese scene texts, showing its application potential in a multilingual environment. Specifically, the main contributions of SVIPTR include: - **Verifying that different permutations and combinations of sparse operators and self - attention mechanisms can accelerate the calculation of Vision Transformer**, and the model built based on these mechanisms can still achieve an accuracy comparable to that of advanced vision - language models in STR tasks, thus achieving a balance between performance and speed advantages. - **Proposing SVIPTR, a vision - semantic feature extraction model specifically designed for parsing image texts**. This model can accurately recognize cross - language image - text inputs and is insensitive to the length of the input image, having good application prospects. - **Verifying the superiority of SVIPTR on cross - language benchmark datasets and manually - annotated industrial application datasets**. Among them, SVIPTRv1 - L exceeds the accuracy of other ViT - based encoder models in both Chinese and English STR; SVIPTRv2 - T achieves the most efficient inference on the premise of ensuring accuracy, with 5.1M parameters and an average inference time of only 3.3ms per text image on NVIDIA V100 GPU. Through the above contributions, SVIPTR provides a powerful solution to STR challenges and greatly promotes practical applications that require fast and efficient STR.

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

SVTR: Scene Text Recognition with a Single Visual Model

Scene Text Detection and Recognition System for Visually Impaired People in Real World

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

SVTR-SRNet: A Deep Learning Model for Scene Text Recognition via SVTR Framework and Spatial Reduction Mechanism

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

OTE: Exploring Accurate Scene Text Recognition Using One Token

Multi-Granularity Prediction for Scene Text Recognition

Flexible scene text recognition based on dual attention mechanism

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

IterVM: Iterative Vision Modeling Module for Scene Text Recognition

Instruction-Guided Scene Text Recognition

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

Efficient Backbone Search for Scene Text Recognition

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval