Split, embed and merge: An accurate table structure recognizer

Zhenrong Zhang,Jianshu Zhang,Jun Du
DOI: https://doi.org/10.48550/arXiv.2107.05214
2022-01-30
Abstract:Table structure recognition is an essential part for making machines understand tables. Its main task is to recognize the internal structure of a table. However, due to the complexity and diversity in their structure and style, it is very difficult to parse the tabular data into the structured format which machines can understand easily, especially for complex tables. In this paper, we introduce Split, Embed and Merge (SEM), an accurate table structure recognizer. Our model takes table images as input and can correctly recognize the structure of tables, whether they are simple or a complex tables. SEM is mainly composed of three parts, splitter, embedder and merger. In the first stage, we apply the splitter to predict the potential regions of the table row (column) separators, and obtain the fine grid structure of the table. In the second stage, by taking a full consideration of the textual information in the table, we fuse the output features for each table grid from both vision and language modalities. Moreover, we achieve a higher precision in our experiments through adding additional semantic features. Finally, we process the merging of these basic table grids in a self-regression manner. The correspondent merging results is learned through the attention mechanism. In our experiments, SEM achieves an average F1-Measure of 97.11% on the SciTSR dataset which outperforms other methods by a large margin. We also won the first place in the complex table and third place in all tables in ICDAR 2021 Competition on Scientific Literature Parsing, Task-B. Extensive experiments on other publicly available datasets demonstrate that our model achieves state-of-the-art.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges of complexity and diversity in Table Structure Recognition (TSR), especially for the parsing of complex tables. Specifically: 1. **Difficulty in parsing complex table structures**: Traditional table structure recognition methods mainly focus on simple tables. For complex tables containing spanning cells, these methods are difficult to accurately parse their internal structures. Spanning cells usually contain important semantic information, such as table headers, which is crucial for understanding the table content. 2. **Insufficient multi - modal information fusion**: Most existing table structure recognition methods only rely on visual features and ignore the rich text information in the tables. This leads to low recognition accuracy when dealing with some tables with visual ambiguity. 3. **Poor adaptability to scanned documents**: Many existing methods rely on PDF metadata or OCR models to extract low - level layout features, which makes them perform poorly when dealing with scanned documents, especially when facing diverse table layouts and text organizations. To solve these problems, the paper proposes the Split, Embed and Merge (SEM) model, aiming to improve the accuracy of table structure recognition in the following ways: - **Split**: Use a fully convolutional network (FCN) to predict the potential areas of table row/column separators, thereby obtaining the fine - grained grid structure of the table. - **Embed**: Design a Vision Module and a Text Module to extract the visual and text features of each table grid respectively, and fuse the two through a Blender Module to make full use of multi - modal information. - **Merge**: Adopt a gated recurrent unit (GRU) decoder with an attention mechanism to gradually predict which basic table grids should be merged to restore table cells and finally obtain the complete table structure. Through these innovations, SEM can not only handle simple tables but also effectively parse complex tables, and can directly operate on table images without relying on metadata or OCR. Experimental results show that SEM outperforms other methods on multiple public datasets, especially achieving significant advantages in the recognition of complex tables.