MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Philippe Gervais,Asya Fadeeva,Andrii Maksai
2024-04-17
Abstract:We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition.
Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The main goal of this paper is to address the problem of Handwritten Mathematical Expression (HME) recognition. Specifically: 1. **Introduction of the MathWriting Dataset**: This is the largest online handwritten mathematical expression dataset to date, containing 230,000 human-written samples and an additional 400,000 synthetic samples. This dataset can be used not only for online HME recognition but also for offline HME recognition by rasterizing the handwriting. 2. **Benchmarking**: A new benchmark based on the MathWriting dataset is introduced to advance research in both online and offline HME recognition. This benchmark includes a test set and uses Character Error Rate (CER) as the evaluation metric. By constructing such a large-scale dataset, the researchers hope to alleviate the demand for handwritten mathematical expression data in research and promote model training and performance improvement in the related field.