Identification of Embedded Mathematical Formulas in PDF Documents Using SVM

Xiaoyan Lin,Liangcai Gao,Zhi Tang,Xuan Hu,Xiaofan Lin
DOI: https://doi.org/10.1117/12.912445
2012-01-01
Abstract:With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.
What problem does this paper attempt to address?