Abstract:In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model's focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model's capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the key challenges in table structure recognition (TSR), especially when dealing with tables containing rich text information. Specifically, the research focuses on the following issues: 1. **Combination of visual and text information**: - Traditional methods mainly rely on the visual features of the table to recover the table structure, but often ignore the text semantic information in the table cells, especially the descriptive text content. - The paper proposes a new framework, UniTabNet, which improves the accuracy of table structure recognition by combining image and text information. 2. **Comprehensive analysis of table structure**: - A table contains not only visual layout information, but also logical properties (such as row and column spans) and physical properties (such as the bounding box coordinates of cells). Existing methods can usually only partially analyze these properties. - UniTabNet realizes a comprehensive analysis of the table structure by designing a logical decoder and a physical decoder to analyze the logical and physical properties of table cells respectively. 3. **Improving prediction accuracy**: - The Vision Guider module is introduced to guide the model to focus on the key areas in the table (such as rows and columns) to improve the prediction accuracy. - At the same time, the Language Guider module is introduced to enhance the model's ability to understand the text content in the table, especially when dealing with descriptive tables. 4. **Dealing with complex scenarios**: - Existing methods perform poorly when dealing with tables in complex scenarios (such as wireless tables or tables with a large number of empty cells). - UniTabNet has achieved state - of - the - art performance on multiple public datasets through an improved architecture and training strategy, which proves its robustness in complex scenarios. ### Formula display - **Polygon Regression Loss**: \[ L_{\text{poly}}=\frac{1}{8}\sum_{j = 1}^{8}(E(p_j)-p_j^*)^2 \] where \(E(p_j)\) is the expected position of the predicted polygon coordinate point \(p_j\), and \(p_j^*\) is the ground - truth label. - **Logical Decoder Prediction**: \[ h_{l_{\text{row}}}^i=\text{Linear}(h_i) \] \[ a_{l_{\text{row}}}=h_{l_{\text{row}}}^i\cdot\text{Loc}^\top \] \[ l_{\text{row}}=\arg\max(a_{l_{\text{row}}}) \] - **Total Loss with Uncertainty**: \[ L_{\text{total}}=\sum_{k = 1}^{5}\left(\frac{1}{2\sigma_k^2}L_k+\log(1 + \sigma_k^2)\right) \] where \(L_k\) represents five different loss terms, and \(\sigma_k\) is a learnable parameter used to adaptively adjust the weights of each loss term. Through these improvements, UniTabNet has achieved significant performance improvement in the table structure recognition task, especially when dealing with descriptive tables.

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

UniTable: Towards a Unified Framework for Table Structure Recognition via Self-Supervised Pretraining

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

TabularNet: A Neural Network Architecture for Understanding Semantic Structures of Tabular Data

Image-based table recognition: data, model, and evaluation

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition

Rethinking Table Structure Recognition Using Sequence Labeling Methods

TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images

Robust Table Detection and Structure Recognition from Heterogeneous Document Images

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

End-to-End Compound Table Understanding with Multi-Modal Modeling

Parsing Table Structures in the Wild

Split, embed and merge: An accurate table structure recognizer

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

SEMv2: Table separation line detection based on instance segmentation

Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

TableFormer: Table Structure Understanding with Transformers