Semi-supervised Feature Learning For Improving Writer Identification

Shiming Chen,Yisong Wang,Chin-Teng Lin,Weiping Ding,Zehong Cao
DOI: https://doi.org/10.1016/j.ins.2019.01.024
2018-10-06
Abstract:Data augmentation is usually used by supervised learning approaches for offline writer identification, but such approaches require extra training data and potentially lead to overfitting errors. In this study, a semi-supervised feature learning pipeline was proposed to improve the performance of writer identification by training with extra unlabeled data and the original labeled data simultaneously. Specifically, we proposed a weighted label smoothing regularization (WLSR) method for data augmentation, which assigned the weighted uniform label distribution to the extra unlabeled data. The WLSR method could regularize the convolutional neural network (CNN) baseline to allow more discriminative features to be learned to represent the properties of different writing styles. The experimental results on well-known benchmark datasets (ICDAR2013 and CVL) showed that our proposed semi-supervised feature learning approach could significantly improve the baseline measurement and perform competitively with existing writer identification approaches. Our findings provide new insights into offline write identification.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the problems of model over - fitting and insufficient feature - learning ability caused by limited labeled data in the off - line writer identification task. Specifically: 1. **Limited labeled data**: Most of the existing methods rely on supervised learning and require a large amount of labeled data to train the model. However, in practical applications, the cost of obtaining a large amount of labeled data is very high, and in the benchmark data sets, the number of handwritten text images provided by each writer is limited. 2. **Over - fitting problem**: To increase the amount of data, some studies have used data augmentation methods, but this is prone to cause model over - fitting, especially in the case of small data sets. 3. **Insufficient feature - learning ability**: Traditional supervised learning methods have difficulty in learning highly discriminative features when dealing with a small amount of labeled data, thus affecting the recognition performance. To solve these problems, the author proposes a semi - supervised feature - learning pipeline, which combines additional unlabeled data and original labeled data for training. By introducing the Weighted Label Smoothing Regularization (WLSR) method, the model can utilize unlabeled data during the training process, reduce the risk of over - fitting, and improve the learning ability of the model, thereby improving the performance of writer identification. ### Specific methods - **Semi - supervised learning framework**: This framework uses both labeled data and unlabeled data for training, aiming to learn more effective features from more data. - **Weighted Label Smoothing Regularization (WLSR)**: For unlabeled data, WLSR assigns a weighted uniform label distribution to it to regularize the Convolutional Neural Network (CNN) so that it can learn more discriminative features. Through these methods, the author hopes to significantly improve the performance of writer identification without adding a large amount of labeled data, and has verified the effectiveness of this method on multiple benchmark data sets.