Page Layout Analysis for Refining Table Extraction from PDF Documents

Andrey Mikhailov,Alexey Shigarov
DOI: https://doi.org/10.1109/ispras53967.2021.00021
2021-12-01
Abstract:Perhaps, PDF is the most popular format to share non-editable documents. PDF documents are often untagged. In particular, this means that positions and the cell structure of tables are not designated explicitly. PDF table detection predicts bounding boxes of tables on document pages. Some of the predictions inevitably happen to be false. This negatively affects the accuracy of table structure recognition. We argue that the page layout analysis in pre- and post-processing can refine the table detection. We suggest pre-processing algorithms for the recognition of headings, running titles, paragraphs, and images in PDF pages. This allows selecting areas of interest inside pages where real tables can be placed. Then we use deep neural networks to predict tables only in these areas. We also propose post-processing algorithms to verify predictions and filter out false table candidates after table detection. Our empirical study shows that the proposed approach reduces errors in the table detection and improve the PDF table extraction overall.
What problem does this paper attempt to address?