A Two-Phase Approach for Recognizing Tables with Complex Structures

Huichao Li,Ling-Ze Meng,Weiyu Zhang,Jianing Zhang,Ju Fan,Meihui Zhang
DOI: https://doi.org/10.1007/978-3-031-00123-9_47
2022-01-01
Abstract:Tables contain rich multi-dimensional information which can be an important source for many data analytics applications. However, table structure information is often unavailable in digitized documents such as PDF or image files, making it hard to perform automatic analysis over high-quality table data. Table structure recognition from digitized files is a non-trivial task, as table layouts often vary greatly in different files. Moreover, the existence of spanning cells further complicates the table structure and brings big challenges in table structure recognition. In this paper, we model the problem as a cell relation extraction task and propose T2, a novel two-phase approach that effectively recognizes table structures from digitized documents. T2 introduces a general concept termed prime relation, which captures the direct relations of cells with high confidence. It further constructs an alignment graph and employs message passing network to discover complex table structures. We validate our approach via extensive experiments over three benchmark datasets. The results demonstrate T2 is highly robust for recognizing complex table structures.
What problem does this paper attempt to address?