Abstract:In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in scene text recognition, the existing self - supervised pre - training methods mainly focus on the local visual feature representation of characters, while ignoring the language information in text images. This has led to insufficient text understanding ability of the model in practical applications, especially when dealing with multilingual texts or in the case of lack of labeled data. To solve this problem, the authors propose Symmetric Superimposition Modeling (SSM), aiming to capture the local features and language information of characters simultaneously, thereby improving the performance of text recognition. ### Main contributions: 1. **Propose a new pre - training framework**: Based on Symmetric Superimposition Modeling (SSM), this is the first self - supervised scene text recognition method that focuses on language learning in the visual space. 2. **Design a dual - architecture**: Joint reconstruction at the pixel - level and feature - level, which can learn the visual features and implicit language information of characters simultaneously, further improving the quality of representation. 3. **Experimental verification**: In multiple text recognition benchmark tests, SSM has achieved the latest best performance, with an average performance improvement of 4.1%, and has reached a new highest average word accuracy of 86.6% in the Union14M benchmark test. In addition, in the multilingual text recognition task, SSM also shows significant advantages, especially in the individual training and joint training settings, outperforming other self - supervised methods by 15.5% and 1.5% in performance respectively. ### Method overview: - **Symmetric superimposed input construction**: Generate symmetrically superimposed inputs through horizontal flipping, vertical flipping and 180 - degree rotation to ensure a large character overlap area. - **Pixel - level image reconstruction**: Use an encoder - regressor - decoder architecture to symmetrically reconstruct the pixels of the original and flipped images according to the orientation index. - **Feature - level representation reconstruction**: Reconstruct features in a high - dimensional space through a projection module and a regressor to enhance the discriminability of character semantics and spatial context modeling. - **Downstream tasks**: Add a text decoder on the basis of the pre - trained model for final character prediction. ### Experimental results: - **Common benchmark tests**: SSM performs excellently in multiple benchmark tests, especially on the IIIT and IC13 datasets, increasing the average accuracy by 13.6% and 9.4% respectively compared to SeqCLR. - **Union14M benchmark test**: Although based on the strong baseline model Scratch - ViT - Small, SSM still improves the performance by an average of 3.5%. - **Multilingual text recognition**: In the MLT19 benchmark test, SSM has an average accuracy that is 15.5% and 1.5% higher than DiG respectively in the individual training and joint training settings, especially in right - to - left Arabic, the performance improvement is particularly significant. Through these contributions, this paper demonstrates the effectiveness and generalization ability of SSM in scene text recognition, especially the advantages in dealing with multilingual texts and data - scarce situations.

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

Scene Text Recognition with Self-supervised Contrastive Predictive Coding

Scene Text Telescope: Text-Focused Scene Image Super-Resolution

Self-Supervised Memory Learning for Scene Text Image Super-Resolution

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

SVTR: Scene Text Recognition with a Single Visual Model

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Self-supervised Pre-training of Text Recognizers

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

TIPS: Text-Image Pretraining with Spatial Awareness

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Self-supervised Character-to-Character Distillation for Text Recognition

One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance

Scene text image super-resolution via textual reasoning and multiscale cross-convolution

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

PreSTU: Pre-Training for Scene-Text Understanding