SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Xianfu Cheng,Weixiao Zhou,Xiang Li,Jian Yang,Hang Zhang,Tao Sun,Wei Zhang,Yuying Mai,Tongliang Li,Xiaoming Chen,Zhoujun Li
2024-08-20
Abstract:Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at <a class="link-external link-https" href="https://github.com/cxfyxl/VIPTR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in Scene Text Recognition (STR), including: 1. **Improving recognition accuracy**: Although the current state - of - the - art (SOTA) STR models show high performance, they usually rely on a hybrid architecture consisting of a visual encoder and a sequence decoder, resulting in low inference efficiency. The paper proposes a new method, SVIPTR (VIsion Permutable Extractor for Scene Text Recognition), aiming to significantly improve the inference speed while maintaining high accuracy. 2. **Enhancing the lightweight and efficiency of the model**: Existing STR models are often difficult to be deployed on resource - limited devices due to their complex structures. SVIPTR realizes the lightweight and efficiency of the model by designing a vision - semantic extractor with a pyramid structure and combining the permutations and combinations of local and global self - attention layers. This design makes the inference speed of SVIPTR insensitive to the input length, thus being suitable for text recognition tasks of different lengths. 3. **Supporting cross - language recognition**: SVIPTR can not only accurately recognize English scene texts, but also effectively process Chinese scene texts, showing its application potential in a multilingual environment. Specifically, the main contributions of SVIPTR include: - **Verifying that different permutations and combinations of sparse operators and self - attention mechanisms can accelerate the calculation of Vision Transformer**, and the model built based on these mechanisms can still achieve an accuracy comparable to that of advanced vision - language models in STR tasks, thus achieving a balance between performance and speed advantages. - **Proposing SVIPTR, a vision - semantic feature extraction model specifically designed for parsing image texts**. This model can accurately recognize cross - language image - text inputs and is insensitive to the length of the input image, having good application prospects. - **Verifying the superiority of SVIPTR on cross - language benchmark datasets and manually - annotated industrial application datasets**. Among them, SVIPTRv1 - L exceeds the accuracy of other ViT - based encoder models in both Chinese and English STR; SVIPTRv2 - T achieves the most efficient inference on the premise of ensuring accuracy, with 5.1M parameters and an average inference time of only 3.3ms per text image on NVIDIA V100 GPU. Through the above contributions, SVIPTR provides a powerful solution to STR challenges and greatly promotes practical applications that require fast and efficient STR.