Research on Mathematical Formula Identification in Digital Chinese Documents

Xiaoyan LIN,Liangcai GAO,Zhi TANG
DOI: https://doi.org/10.13209/j.0479-8023.2014.009
2014-01-01
Abstract:Different from the traditional formula identification methods for scanned images and Latin documents, a formula identification method which considers the characteristics of digital Chinese documents is proposed to identify both isolated and embedded formulae using both machine learning techniques and heuristic rules. Text line detection strategies and word segmentation rules are proposed towards Chinese documents, effective features and machine learning algorithms of formula identification from Chinese documents are selected, and post-processing techniques, including text line or word merging, are proposed to overcome the over-segmentation problems. The experimental results show that the proposed method achieves satisfactory results in identifying formulae from digital Chinese documents. Furthermore, a public Chinese document dataset is constructed in order to facilitate the fair comparison between different formula identification methods. ? 2014 Peking University.
What problem does this paper attempt to address?