Mathematical Formula Identification in PDF Documents

Xiaoyan Lin,Liangcai Gao,Zhi Tang,Xiaofan Lin,Xuan Hu
DOI: https://doi.org/10.1109/icdar.2011.285
2011-01-01
Abstract:Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.
What problem does this paper attempt to address?