Abstract:Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at <a class="link-external link-https" href="https://github.com/qqqyd/ViSu" rel="external noopener nofollow">this https URL</a>.

Sequential Style Consistency Learning for Domain-Generalizable Text Recognition.

IS2Net: Intra-domain Semantic and Inter-domain Style Enhancement for Semi-supervised Medical Domain Generalization

Towards Self-Similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation.

Sequence-To-Sequence Domain Adaptation Network For Robust Text Image Recognition

Chasing Consistency in Text-to-3D Generation from a Single Image.

Exploring Style-Robust Scene Text Detection via Style-Aware Learning

Sequential visual and semantic consistency for semi-supervised text recognition

Style-Content Metric Learning for Multidomain Remote Sensing Object Recognition.

Discriminative Style Learning for Cross-Domain Image Captioning

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Domain generalization person re-identification via style adaptation learning

MSSRNet: Manipulating Sequential Style Representation for Unsupervised Text Style Transfer

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

SC2: Towards Enhancing Content Preservation and Style Consistency in Long Text Style Transfer

Robust Text Image Recognition via Adversarial Sequence-to-Sequence Domain Adaptation

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

Text Recognition in Real Scenarios with a Few Labeled Samples

Rethink arbitrary style transfer with transformer and contrastive learning

Synthesizing Data for Text Recognition with Style Transfer