Abstract:Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross‐modal scene text recognition, text semantic enhancement, and visual‐text feature alignment. In the first stage, multi‐attention is used to extract features layer by layer, and a self‐mask‐based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual‐text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con‐Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine‐grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively.

Self-Supervised Cross-Language Scene Text Editing

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning.

Scene Text Transfer for Cross-Language

Scene Style Text Editing

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Exploring Stroke-Level Modifications for Scene Text Editing

Weakly supervised scene text generation for low-resource languages

Editing Text in the Wild

RewriteNet: Reliable Scene Text Editing with Implicit Decomposition of Text Contents and Styles

Explicitly-Decoupled Text Transfer With Minimized Background Reconstruction for Scene Text Editing

Language Anisotropic Cross-Lingual Model Editing

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Synthetically Supervised Feature Learning For Scene Text Recognition

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach

Cross‐modal knowledge learning with scene text for fine‐grained image classification

Progressive Scene Text Erasing with Self-Supervision.

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization