A Symbol Dominance Based Formulae Recognition Approach For Pdf Documents

Xiaode Zhang,Liangcai Gao,Ke Yuan,Runtao Liu,Zhuoren Jiang,Zhi Tang
DOI: https://doi.org/10.1109/ICDAR.2017.189
2017-01-01
Abstract:With more and more scientific documents becoming available in PDF format, recognition of formulae in these PDF documents is of great significance. In this paper, we propose a symbol dominance based formulae recognition approach to recovering formulae structures by using the rich information extracted directly from PDF files. The hierarchical structure of formula is represented by relationship tree, and the tree is built recursively based on symbol dominance, which considers both the spatial layout of symbols and the typesetting conventions of mathematics. In addition, we propose a special character recognition method to identify the formula characters with multiple components or variable unicode. Repeatable and comparable experiments have been done over two large datasets, IM2LATEX-100K and PDFME-10K. Experimental results demonstrate that our method is more adaptive and practical for PDF documents compared with other two existing available formulae recognition systems, INFTY and WYGIWYS.
What problem does this paper attempt to address?