An OpenCV-based Framework for Table Information Extraction
Jiayi Yuan,Hongye Li,Meng Wang,Ruyang Liu,Chuanyou Li,Beilun Wang
DOI: https://doi.org/10.1109/icbk50248.2020.00093
2020-08-01
Abstract:Portable Document Format (PDF), as one of the most popular file format, is especially useful for educational documents such as text books, articles, or papers in which we can preserve the original graphic appearance and conveniently share online. Detecting and extracting information from tables in PDF files can provide a plethora of structural data to construct educational knowledge graphs. However, most of the existing methods rely on PDF parsing tools and natural language processing techniques, which generally require training samples and are frail in handling cross-page tables. In light of this, in this paper, we propose a novel OpenCV-based framework to extract the metadata and specific values from PDF tables. Specifically, we first highlight the visual outline of the tables. Then, we locate tables using horizontal and vertical lines and get the coordinates of tabular frames in each PDF page. Once the tables are successfully detected, for each table, we detect the cross-page scenarios and use the Optical Character Recognition (OCR) engine to extract the specific values in each table cell. Differing from other machine learning based methods, the proposed method can achieve table information extraction accurately without labeled data. We conduct extensive experiments on real-world PDF files. The results demonstrate that our approach can effectively deal with cross-page tables and only need 6.12 seconds on average to process a table.