On methods and tools of table detection, extraction and annotation in PDF documents

Shah Khusro,Asima Latif,Irfan Ullah
DOI: https://doi.org/10.1177/0165551514551903
2014-10-03
Journal of Information Science
Abstract:Table detection, extraction and annotation have been an important research problem for years. To handle this issue, different approaches have been designed for different types of documents. Among these PDF is a widely used format for preserving and presenting different types of documents. We investigate the state of the art in table detection, extraction and annotation in PDF documents. Because of varying table structural anatomy, the state of the art in table-related research enumerates a number of approaches that are critically and analytically investigated for identifying their strengths and limitations as well as for making recommendations for further improvement. An evaluation framework is contributed that compares different information extraction tools that may be used in table detection, extraction and annotation. We found very limited attention towards these aspects in books, especially books in PDF format. There is no searching solution that can find books having tables that are semantically related to a table in a given book.
computer science, information systems,information science & library science
What problem does this paper attempt to address?