Out of Length Text Recognition with Sub-String Matching

Yongkun Du,Zhineng Chen,Caiyan Jia,Xieping Gao,Yu-Gang Jiang

2024-08-13

Abstract:Scene Text Recognition (STR) methods have demonstrated robust performance in word-level text recognition. However, in real applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short (i.e., word-level) text datasets, which has been less studied previously. In this paper, we term this task Out of Length (OOL) text recognition. We establish the first Long Text Benchmark (LTB) to facilitate the assessment of different methods in long text recognition. Meanwhile, we propose a novel method called OOL Text Recognition with sub-String Matching (SMTR). SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features, matching the sub-string and simultaneously recognizing its next and previous character. SMTR can recognize text of arbitrary length by iterating the process above. To avoid being trapped in recognizing highly similar sub-strings, we introduce a regularization training to compel SMTR to effectively discover subtle differences between similar sub-strings for precise matching. In addition, we propose an inference augmentation strategy to alleviate confusion caused by identical sub-strings in the same text and improve the overall recognition efficiency. Extensive experimental results reveal that SMTR, even when trained exclusively on short text, outperforms existing methods in public short text benchmarks and exhibits a clear advantage on LTB. Code: <a class="link-external link-https" href="https://github.com/Topdu/OpenOCR" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of recognizing long text in natural scenes. Specifically, existing Scene Text Recognition (STR) methods perform well on word-level text recognition, but in practical applications, the detected text images sometimes consist of multiple horizontally arranged words, forming longer lines of text. In such cases, it is necessary to build models capable of handling long text. However, most current STR datasets mainly contain short word-level text and lack support for long text. Therefore, how to use datasets containing only short text to train models that can accurately recognize long text has become a challenging research topic. The paper refers to this task as Out of Length (OOL) text recognition and proposes corresponding solutions. The main contributions of the paper include: 1. **Defining the Problem**: For the first time, the need to build long text recognition models based on short text datasets is clearly proposed, and it is named the OOL text recognition challenge. 2. **Establishing a Benchmark**: The first Long Text Benchmark (LTB) is established to evaluate the performance of different methods in long text recognition. 3. **Proposing a New Method**: A new method named OOL Text Recognition with Substring Matching (SMTR) is proposed, which addresses the OOL challenge through substring matching techniques. 4. **Optimization Strategies**: Regularized training and inference enhancement strategies are introduced to improve the accuracy and efficiency of the model when dealing with similar and repetitive substrings. Through these contributions, the paper not only enriches the research in the STR field but also provides new solutions for handling diverse application scenarios in the real world.

Out of Length Text Recognition with Sub-String Matching

Instruction-Guided Scene Text Recognition

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes

A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

SVTR: Scene Text Recognition with a Single Visual Model

OTE: Exploring Accurate Scene Text Recognition Using One Token

Flexible scene text recognition based on dual attention mechanism

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

CSTR: A Classification Perspective on Scene Text Recognition.

MTSTR: Multi-task learning for low-resolution scene text recognition via dual attention mechanism and its application in logistics industry

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Word length-aware text spotting: Enhancing detection and recognition in dense text image

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Multi-Granularity Prediction for Scene Text Recognition

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

Sequential visual and semantic consistency for semi-supervised text recognition