Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Zuan Gao,Yuxin Wang,Yadong Qu,Boqiang Zhang,Zixiao Wang,Jianjun Xu,Hongtao Xie
2024-05-11
Abstract:In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in scene text recognition, the existing self - supervised pre - training methods mainly focus on the local visual feature representation of characters, while ignoring the language information in text images. This has led to insufficient text understanding ability of the model in practical applications, especially when dealing with multilingual texts or in the case of lack of labeled data. To solve this problem, the authors propose Symmetric Superimposition Modeling (SSM), aiming to capture the local features and language information of characters simultaneously, thereby improving the performance of text recognition. ### Main contributions: 1. **Propose a new pre - training framework**: Based on Symmetric Superimposition Modeling (SSM), this is the first self - supervised scene text recognition method that focuses on language learning in the visual space. 2. **Design a dual - architecture**: Joint reconstruction at the pixel - level and feature - level, which can learn the visual features and implicit language information of characters simultaneously, further improving the quality of representation. 3. **Experimental verification**: In multiple text recognition benchmark tests, SSM has achieved the latest best performance, with an average performance improvement of 4.1%, and has reached a new highest average word accuracy of 86.6% in the Union14M benchmark test. In addition, in the multilingual text recognition task, SSM also shows significant advantages, especially in the individual training and joint training settings, outperforming other self - supervised methods by 15.5% and 1.5% in performance respectively. ### Method overview: - **Symmetric superimposed input construction**: Generate symmetrically superimposed inputs through horizontal flipping, vertical flipping and 180 - degree rotation to ensure a large character overlap area. - **Pixel - level image reconstruction**: Use an encoder - regressor - decoder architecture to symmetrically reconstruct the pixels of the original and flipped images according to the orientation index. - **Feature - level representation reconstruction**: Reconstruct features in a high - dimensional space through a projection module and a regressor to enhance the discriminability of character semantics and spatial context modeling. - **Downstream tasks**: Add a text decoder on the basis of the pre - trained model for final character prediction. ### Experimental results: - **Common benchmark tests**: SSM performs excellently in multiple benchmark tests, especially on the IIIT and IC13 datasets, increasing the average accuracy by 13.6% and 9.4% respectively compared to SeqCLR. - **Union14M benchmark test**: Although based on the strong baseline model Scratch - ViT - Small, SSM still improves the performance by an average of 3.5%. - **Multilingual text recognition**: In the MLT19 benchmark test, SSM has an average accuracy that is 15.5% and 1.5% higher than DiG respectively in the individual training and joint training settings, especially in right - to - left Arabic, the performance improvement is particularly significant. Through these contributions, this paper demonstrates the effectiveness and generalization ability of SSM in scene text recognition, especially the advantages in dealing with multilingual texts and data - scarce situations.