Abstract:Optical character recognition (OCR) is the process of acquiring text and layout information through analysis and recognition of text data image files. It is also a process to identify the geometric location and orientation of the texts and their symmetrical behavior. It usually consists of two steps: text detection and text recognition. Scene text recognition is a subfield of OCR that focuses on processing text in natural scenes, such as streets, billboards, license plates, etc. Unlike traditional document category photographs, it is a challenging task to use computer technology to locate and read text information in natural scenes. Imaging sequence recognition is a longstanding subject of research in the field of computer vision. Great progress has been made in this field; however, most models struggled to recognize text in images of complex scenes with high accuracy. This paper proposes a new pattern of text recognition based on the convolutional recurrent neural network (CRNN) as a solution to address this issue. It combines real-time scene text detection with differentiable binarization (DBNet) for text detection and segmentation, text direction classifier, and the Retinex algorithm for image enhancement. To evaluate the effectiveness of the proposed method, we performed experimental analysis of the proposed algorithm, and carried out simulation on complex scene image data based on existing literature data and also on several real datasets designed for a variety of nonstationary environments. Experimental results demonstrated that our proposed model performed better than the baseline methods on three benchmark datasets and achieved on-par performance with other approaches on existing datasets. This model can solve the problem that CRNN cannot identify text in complex and multi-oriented text scenes. Furthermore, it outperforms the original CRNN model with higher accuracy across a wider variety of application scenarios.

Visual and semantic guided scene text retrieval

Visual Matching is Enough for Scene Text Retrieval.

Scene Text Detection and Recognition System for Visually Impaired People in Real World

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Image Retrieval for Visual Localization via Scene Text Detection and Logo Filtering

Beyond visual semantics: Exploring the role of scene text in image understanding

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

SEE: Towards Semi-Supervised End-to-End Scene Text Recognition

Efficient scene text image super-resolution with semantic guidance

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

StacMR: Scene-Text Aware Cross-Modal Retrieval

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Scene Graph Based Fusion Network For Image-Text Retrieval

A Convolutional Recurrent Neural-Network-Based Machine Learning for Scene Text Recognition Application

A Text-Context-Aware CNN Network for Multi-oriented and Multi-language Scene Text Detection.

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval