Abstract:Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross‐modal scene text recognition, text semantic enhancement, and visual‐text feature alignment. In the first stage, multi‐attention is used to extract features layer by layer, and a self‐mask‐based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual‐text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con‐Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine‐grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively.

Knowledge Mining with Scene Text for Fine-Grained Recognition

Cross‐modal knowledge learning with scene text for fine‐grained image classification

Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

Automatic Scene Recognition Based on Constructed Knowledge Space Learning

Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

Knowledge-Based Scene Text Recognition for Industrial Applications

Robust Scene Parsing by Mining Supportive Knowledge From Dataset

Knowledge Distillation Via Entropy Map for Scene Text Detection

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Mining Discriminative Visual Features Based on Semantic Relations.

A New Parallel Detection-Recognition Approach for End-to-End Scene Text Extraction.

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution Cnns

A Semantic-Driven Image Scene Fine-Grained Enhancement Recognition

Class-Aware Mask-guided feature refinement for scene text recognition

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Knowledge Aware Semantic Concept Expansion for Image-Text Matching.

Semantic-aware scene recognition

Improved Fusion of Visual and Semantic Representations by Gated Co-Attention for Scene Text Recognition.

Mining Contextual Information Beyond Image for Semantic Segmentation

Spatial-aware Collaborative Region Mining for Fine-Grained Recognition