Abstract:An optical character recognition (OCR) system segments the character from the given document before recognizing it. The recognition of such character images requires the class labels to be associated with each character sample in the training set, and this requires the placing of all the samples of each segmented character in various distinct folders. However, it has to be done manually, and thus, it is a time-consuming process. The ancient documents suffer from humidity spots, ink stains, and faded portions of text which makes the character recognition task even more challenging for the ancient documents. The present article proposes a novel semi-self-supervised learning-based OCR method to recognize each character segmented from the ancient documents handwritten in Devanagari and Maithili scripts. The proposed method has two modules—feature extraction module and recognition module. The feature extraction module has extracted deep hierarchical features from each pre-segmented character image employing generative self-supervised learning approach. The recognition module has focused on important features using an attention mechanism and learns the long temporal sequence using the Gated Recurrent Unit variant of recurrent neural network classifier to classify each segmented character into its proper class. The feature extraction module in the proposed method has been trained using the 60% (unlabelled) of the dataset, whereas the recognition module has been trained using the 5% (manually labelled) of the dataset. The performance of the proposed novel OCR method has been evaluated on two self-generated datasets of ancient handwritten documents in Devanagari and Maithili scripts. The experimental results demonstrate that the proposed OCR method outperforms the state-of-the-art (SOTA) methods in this regard. The proposed OCR method has improved the character recognition accuracy in comparison with the SOTA methods by 2.27% and 3.48% in Devanagari and Maithili scripts, respectively.

Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script

Handwritten Script Identification from Text Lines

A New Approach for Texture based Script Identification At Block Level using Quad Tree Decomposition

Handwritten Bangla Basic and Compound character recognition using MLP and SVM classifier

Cross-language Framework for Word Recognition and Spotting of Indic Scripts

Script identification in handwritten and printed documents using convolutional recurrent connection

Optical Script Identification for multi-lingual Indic-script

Segmentation of Offline Handwritten Bengali Script

Classification of Bangla Compound Characters Using a HOG-CNN Hybrid Model

End-to-End Optical Character Recognition for Bengali Handwritten Words

CNN-Bidirectional LSTM Based Optical Character Recognition of Sanskrit Manuscripts : A Comprehensive Systematic Literature Review

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

A semi-self-supervised learning model to recognize handwritten characters in ancient documents in Indian scripts

Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding

Bangla-Meitei Mayek scripts handwritten character recognition using Convolutional Neural Network

Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

Script-Agnostic Language Identification

Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

BN-DRISHTI: Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images

Handwritten OCR for Indic Scripts: A Comprehensive Overview of Machine Learning and Deep Learning Techniques

Residual attention-based multi-scale script identification in scene text images