MNIST-MIX: A Multi-language Handwritten Digit Recognition Dataset

Weiwei Jiang
DOI: https://doi.org/10.1088/2633-1357/abad0e
2020-04-08
Abstract:In this letter, we contribute a multi-language handwritten digit recognition dataset named MNIST-MIX, which is the largest dataset of the same type in terms of both languages and data samples. With the same data format with MNIST, MNIST-MIX can be seamlessly applied in existing studies for handwritten digit recognition. By introducing digits from 10 different languages, MNIST-MIX becomes a more challenging dataset and its imbalanced classification requires a better design of models. We also present the results of applying a LeNet model which is pre-trained on MNIST as the baseline.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing handwritten digit recognition datasets (such as MNIST) are too simple for modern deep - learning models and cannot fully evaluate the performance of these models. To meet this challenge, the author proposes a multilingual handwritten digit recognition dataset named MNIST - MIX. This dataset not only contains handwritten digits in multiple languages, but also has the largest number of samples among similar datasets, aiming to provide a more challenging benchmark environment to promote the development of handwritten digit recognition technology. Specifically, MNIST - MIX combines handwritten digit images in 10 languages (Arabic, Bengali, Devanagari, English, Persian, Kannada, Swedish, Telugu, Tibetan and Urdu) from 13 different datasets. In this way, MNIST - MIX not only increases the diversity and complexity of the dataset, but also introduces an imbalanced classification problem, that is, the number of samples in different languages varies greatly, which requires the design of more complex models to handle this imbalance. In addition, the author also provides a pre - trained LeNet model as a baseline model. This model is trained on the original MNIST dataset and tested on MNIST - MIX to show the challenges of this new dataset. The experimental results show that although the LeNet model achieves an accuracy of 90.22% on MNIST - MIX, its balanced accuracy is only 66.61%, indicating that there is still much room for improvement in the performance of the model when dealing with highly imbalanced datasets.