Table Detection from Plain Text Using Machine Learning and Document Structure

Juanzi Li,Jie Tang,Qiang Song,Peng Xu
DOI: https://doi.org/10.1007/11610113_79
2006-01-01
Abstract: Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naïve Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.
What problem does this paper attempt to address?