Abstract:Extracting information from documents containing quantitative data in tabular format is an important but still unsolved task due to the heterogeneity of document layouts. This work aims to take a step toward developing a solution to this problem. This paper proposes a flexible, hybrid table extraction system consisting of a deep learning-based table detection module, a heuristic-based structure recognition method, and a graph-based semantic interpretation component. The proposed system is modular and supports the most frequent table layouts. Moreover, it handles both the documents in image format and PDF files with embedded text. The proposed system outperforms the baseline method and achieves results on par with state-of-the-art approaches on the challenging benchmarks from ICDAR 2013 and ICDAR 2019 table interpretation competitions. Moreover, we correct an issue with the evaluation script used in the latter competition and report extended results of the proposed method in comparison with a leading commercial product. Finally, our table extraction system achieves a high F score in the scenario where raw documents are given as input and the targeted information is contained in a subset of table columns. The presented system achieves results competitive with leading methods in the field. It has already been evaluated on general-purpose data and biomedical benchmarks. We intend to continuously improve our approach and process data from other domains, e.g., financial documents. To support future research on information extraction from documents, we make the evaluation scripts and results from our experiments publicly available at https://github.com/mnamysl/tabrec-sncs.

Design of an end-to-end method to extract information from tables

A text-based analysis approach to representing the design selection process

A framework for information extraction from tables in biomedical literature

Text-to-Table: A New Way of Information Extraction

Extracting information from WEB tables based on abstract semantic model

A fully automated approach to a complete Semantic Table Interpretation

TEXUS: Table Extraction System for PDF Documents

Flexible Hybrid Table Recognition and Semantic Interpretation System

Extracting Web table information in cooperative learning activities based on abstract semantic model

An OpenCV-based Framework for Table Information Extraction

TABLEIE: Capturing the Interactions among Sub-Tasks in Information Extraction Via Double Tables

Table understanding in structured documents

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Untagged Table Extraction in Semi-structured Documents

Schema-Driven Information Extraction from Heterogeneous Tables

Page Layout Analysis for Refining Table Extraction from PDF Documents

Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

A Conglomerate of Multiple OCR Table Detection and Extraction

Table Structure Recognition using Top-Down and Bottom-Up Cues

On methods and tools of table detection, extraction and annotation in PDF documents

Automating the extraction of data from HTML tables with unknown structure