Abstract:Despite the recent remarkable improvements in scene text recognition (STR), the majority of the studies focused mainly on the English language, which only includes few number of characters. However, STR models show a large performance degradation on languages with a numerous number of characters (e.g., Chinese and Korean), especially on characters that rarely appear due to the long-tailed distribution of characters in such languages. To address such an issue, we conducted an empirical analysis using synthetic datasets with different character-level distributions (e.g., balanced and long-tailed distributions). While increasing a substantial number of tail classes without considering the context helps the model to correctly recognize characters individually, training with such a synthetic dataset interferes the model with learning the contextual information (i.e., relation among characters), which is also important for predicting the whole word. Based on this motivation, we propose a novel Context-Aware and Free Experts Network (CAFE-Net) using two experts: 1) context-aware expert learns the contextual representation trained with a long-tailed dataset composed of common words used in everyday life and 2) context-free expert focuses on correctly predicting individual characters by utilizing a dataset with a balanced number of characters. By training two experts to focus on learning contextual and visual representations, respectively, we propose a novel confidence ensemble method to compensate the limitation of each expert. Through the experiments, we demonstrate that CAFE-Net improves the STR performance on languages containing numerous number of characters. Moreover, we show that CAFE-Net is easily applicable to various STR models.

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

Context Perception Parallel Decoder for Scene Text Recognition

OTE: Exploring Accurate Scene Text Recognition Using One Token

Hierarchical Refined Attention for Scene Text Recognition.

Improving Scene Text Recognition for Character-Level Long-Tailed Distribution

Scene Text Recognition with Cascade Attention Network.

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Context-Based Contrastive Learning for Scene Text Recognition

Masked and Permuted Implicit Context Learning for Scene Text Recognition

Focus-Enhanced Scene Text Recognition with Deformable Convolutions

Character Region Awareness Network for Scene Text Recognition

Sequential Deformation for Accurate Scene Text Detection

LATextSpotter: Empowering Transformer Decoder with Length Perception Ability

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance