Abstract:Extracting information from documents containing quantitative data in tabular format is an important but still unsolved task due to the heterogeneity of document layouts. This work aims to take a step toward developing a solution to this problem. This paper proposes a flexible, hybrid table extraction system consisting of a deep learning-based table detection module, a heuristic-based structure recognition method, and a graph-based semantic interpretation component. The proposed system is modular and supports the most frequent table layouts. Moreover, it handles both the documents in image format and PDF files with embedded text. The proposed system outperforms the baseline method and achieves results on par with state-of-the-art approaches on the challenging benchmarks from ICDAR 2013 and ICDAR 2019 table interpretation competitions. Moreover, we correct an issue with the evaluation script used in the latter competition and report extended results of the proposed method in comparison with a leading commercial product. Finally, our table extraction system achieves a high F score in the scenario where raw documents are given as input and the targeted information is contained in a subset of table columns. The presented system achieves results competitive with leading methods in the field. It has already been evaluated on general-purpose data and biomedical benchmarks. We intend to continuously improve our approach and process data from other domains, e.g., financial documents. To support future research on information extraction from documents, we make the evaluation scripts and results from our experiments publicly available at https://github.com/mnamysl/tabrec-sncs.

Table Detection from Plain Text Using Machine Learning and Document Structure

CNN Based Page Object Detection in Document Images

Table understanding in structured documents

Untagged Table Extraction in Semi-structured Documents

Table Structure Recognition using Top-Down and Bottom-Up Cues

An OpenCV-based Framework for Table Information Extraction

UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

Table detection in business document images by message passing networks

TDeLTA: A Light-weight and Robust Table Detection Method based on Learning Text Arrangement

Automatic Table Boundary Detection and Performance Evaluation in Fixed-Layout Documents

A Table Detection Method for PDF Documents Based on Convolutional Neural Networks

Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

Table Structure Extraction with Bi-directional Gated Recurrent Unit Networks

Table detection in online ink notes.

Detecting Table Region in PDF Documents Using Distant Supervision

Flexible Hybrid Table Recognition and Semantic Interpretation System

Complicated Table Structure Recognition

PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

Robust Table Detection and Structure Recognition from Heterogeneous Document Images

A human-machine method for web table understanding

Table Header Detection and Classification.