UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

Zhenrong Zhang,Shuhang Liu,Pengfei Hu,Jiefeng Ma,Jun Du,Jianshu Zhang,Yu Hu
2024-09-20
Abstract:In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model's focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model's capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges in table structure recognition (TSR), especially when dealing with tables containing rich text information. Specifically, the research focuses on the following issues: 1. **Combination of visual and text information**: - Traditional methods mainly rely on the visual features of the table to recover the table structure, but often ignore the text semantic information in the table cells, especially the descriptive text content. - The paper proposes a new framework, UniTabNet, which improves the accuracy of table structure recognition by combining image and text information. 2. **Comprehensive analysis of table structure**: - A table contains not only visual layout information, but also logical properties (such as row and column spans) and physical properties (such as the bounding box coordinates of cells). Existing methods can usually only partially analyze these properties. - UniTabNet realizes a comprehensive analysis of the table structure by designing a logical decoder and a physical decoder to analyze the logical and physical properties of table cells respectively. 3. **Improving prediction accuracy**: - The Vision Guider module is introduced to guide the model to focus on the key areas in the table (such as rows and columns) to improve the prediction accuracy. - At the same time, the Language Guider module is introduced to enhance the model's ability to understand the text content in the table, especially when dealing with descriptive tables. 4. **Dealing with complex scenarios**: - Existing methods perform poorly when dealing with tables in complex scenarios (such as wireless tables or tables with a large number of empty cells). - UniTabNet has achieved state - of - the - art performance on multiple public datasets through an improved architecture and training strategy, which proves its robustness in complex scenarios. ### Formula display - **Polygon Regression Loss**: \[ L_{\text{poly}}=\frac{1}{8}\sum_{j = 1}^{8}(E(p_j)-p_j^*)^2 \] where \(E(p_j)\) is the expected position of the predicted polygon coordinate point \(p_j\), and \(p_j^*\) is the ground - truth label. - **Logical Decoder Prediction**: \[ h_{l_{\text{row}}}^i=\text{Linear}(h_i) \] \[ a_{l_{\text{row}}}=h_{l_{\text{row}}}^i\cdot\text{Loc}^\top \] \[ l_{\text{row}}=\arg\max(a_{l_{\text{row}}}) \] - **Total Loss with Uncertainty**: \[ L_{\text{total}}=\sum_{k = 1}^{5}\left(\frac{1}{2\sigma_k^2}L_k+\log(1 + \sigma_k^2)\right) \] where \(L_k\) represents five different loss terms, and \(\sigma_k\) is a learnable parameter used to adaptively adjust the weights of each loss term. Through these improvements, UniTabNet has achieved significant performance improvement in the table structure recognition task, especially when dealing with descriptive tables.