Robust Math Formula Recognition in Degraded Chinese Document Images

Ning Liu,Dongxiang Zhang,Xing Xu,Long Guo,Lijiang Chen,Wenju Liu,Dengfeng Ke
DOI: https://doi.org/10.1109/icdar.2017.27
2017-01-01
Abstract:In this paper, we study the problem of math formula recognition (MFR) in degraded Chinese document images. Compared to traditional optical character recognition (OCR), the MFR problem brings new challenges in terms of character segmentation and structural analysis, especially in degraded images. To tackle these issues, we propose an over-segmentation strategy to split and recognize adhesive formula elements based on convolutional neural network (CNN). In addition, we propose a hierarchical framework for formula structure analysis that constructs the formula in a top-down manner to iteratively split the regions into recognizable units. Due to the lack of degraded Chinese document images with math formulas in the community, we also harvest a diverse ground-truth dataset containing 100 images submitted from our system users. Extended experiments demonstrate the effectiveness and robustness of our proposed method in comparison with state-of-the-art methods.
What problem does this paper attempt to address?