A Mathematics Retrieval System for Formulae in Layout Presentations
Xiaoyan Lin,Liangcai Gao,Xuan Hu,Zhi Tang,Yingnan Xiao,Xiaozhong Liu
DOI: https://doi.org/10.1145/2600428.2609611
2014-01-01
Abstract:The semantics of mathematical formulae depend on their spatial structure, and they usually exist in layout presentations such as PDF, LaTeX, and Presentation MathML, which challenges previous text index and retrieval methods. This paper proposes an innovative mathematics retrieval system along with the novel algorithms, which enables efficient formula index and retrieval from both webpages and PDF documents. Unlike prior studies, which require users to manually input formula markup language as query, the new system enables users to "copy" formula queries directly from PDF documents. Furthermore, by using a novel indexing and matching model, the system is aimed at searching for similar mathematical formulae based on both textual and spatial similarities. A hierarchical generalization technique is proposed to generate sub-trees from the semi-operator tree of formulae and support substructure match and fuzzy match. Experiments based on massive Wikipedia and CiteSeer repositories show that the new system along with novel algorithms, comparing with two representative mathematics retrieval systems, provides more efficient mathematical formula index and retrieval, while simplifying user query input for PDF documents.