Integrating Canonical Neural Units and Multi-Scale Training for Handwritten Text Recognition

Zi-Rui Wang
2024-10-24
Abstract:The segmentation-free research efforts for addressing handwritten text recognition can be divided into three categories: connectionist temporal classification (CTC), hidden Markov model and encoder-decoder methods. In this paper, inspired by the above three modeling methods, we propose a new recognition network by using a novel three-dimensional (3D) attention module and global-local context information. Based on the feature maps of the last convolutional layer, a series of 3D blocks with different resolutions are split. Then, these 3D blocks are fed into the 3D attention module to generate sequential visual features. Finally, by integrating the visual features and the corresponding global-local context features, a well-designed representation can be obtained. Main canonical neural units including attention mechanisms, fully-connected layer, recurrent unit and convolutional layer are efficiently organized into a network and can be jointly trained by the CTC loss and the cross-entropy loss. Experiments on the latest Chinese handwritten text datasets (the SCUT-HCCDoc and the SCUT-EPT) and one English handwritten text dataset (the IAM) show that the proposed method can make a new milestone.
Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in Handwritten Text Recognition (HTR), specifically including: 1. **Segmentation - independent handwritten text recognition**: - Handwritten text recognition is a typical sequence - to - sequence problem and can be formulated as a Bayesian decision problem. Traditional methods usually rely on character detection boxes or additional training data, while segmentation - independent methods only need to be provided with text - level labels during the training phase. - The paper mainly focuses on three typical segmentation - independent methods: Hidden Markov Model (HMM), Connectionist Temporal Classification (CTC), and Encoder - Decoder (ED) framework. 2. **Limitations of existing methods**: - Although HMM can represent characters with high resolution, there are too many network output nodes for modeling state posterior probabilities, making it difficult to conduct end - to - end training, and the computational complexity of expanding 1D HMM to 2D HMM is high. - CTC and ED methods early on rely on the local receptive fields of convolutional layers or gradually reduce the height of feature maps to 1 pixel by stacking pooling layers, which may lead to information loss. 3. **Propose new solutions**: - To overcome the above problems, the paper proposes a new recognition network, using a novel 3D Attention Module and Global - Local Context Information. This method can explicitly extract two - dimensional information of feature blocks with different resolutions. - By introducing a multi - scale training strategy and combining CTC loss and Cross - Entropy Loss, the method proposed in the paper can achieve results comparable to the existing state - of - the - art methods on multiple datasets. ### Main contributions 1. **Improve the recognition network**: - Inspired by typical segmentation - independent methods, improve the text recognition network by introducing the 3D Attention Module and Global - Local Context Information. 2. **Effectively organize neural units**: - Carefully organize the main classic neural units, such as attention mechanisms, fully - connected layers, recurrent units, and convolutional layers, to form an efficient network structure. 3. **Multi - scale training strategy**: - Propose a multi - scale training method, including extracting 3D blocks with different resolutions and simultaneously using CTC loss and Cross - Entropy Loss for joint training. 4. **Experimental verification**: - Experiments were carried out on the latest Chinese handwritten text datasets (SCUT - HCCDoc and SCUT - EPT) and an English handwritten text dataset (IAM). The results show that the proposed method can achieve an effect comparable to the state - of - the - art methods, and a comprehensive analysis was carried out to verify the effects of the 3D Attention Module and different features. In summary, this paper is committed to improving the performance of handwritten text recognition. In particular, on the basis of segmentation - independent methods, by introducing new network structures and training strategies, it has solved the problems of information loss and difficulty in end - to - end training in existing methods.