Abstract:To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.

What problem does this paper attempt to address?

### The Problem Addressed by This Paper This paper aims to address the limitations and challenges of current automatic table data annotation methods and random table data synthesis methods. Specifically, the paper proposes a new method to generate annotated data specifically for table recognition. This method leverages the structure and content of existing complex tables to efficiently create tables that closely resemble the real styles in the target domain. #### Main Problems and Solutions: 1. **Lack of large-scale high-quality annotated datasets**: Most current table recognition methods rely on large-scale, high-quality annotated datasets for model training. However, existing public datasets are often not comprehensive and detailed enough, hindering the development of table structure recognition technology. - **Solution**: A method is proposed to generate high-fidelity synthetic datasets by utilizing the structure and content of existing complex tables. 2. **Errors in automatically annotated datasets**: Existing automatically annotated datasets often contain a large number of annotation errors. For example, the FinTabNet dataset contains about 9% obvious annotation errors. - **Solution**: By analyzing the distribution of actual tables, extracting style features, and generating more realistic table styles, annotation errors are reduced. 3. **Diversity of table styles in different domains and languages**: Table styles vary greatly across different domains and languages, and existing datasets cannot adapt to a wide range of application scenarios. - **Solution**: By utilizing the actual table structure and content of specific domains, synthetic datasets that conform to the characteristics of those domains are generated. For example, tables from Chinese financial announcements are used to generate the first large-scale Chinese financial announcement table annotation dataset. 4. **Enhancing the complexity of existing datasets**: To improve the model's performance on complex tables, the paper also proposes a method to enhance the complexity of existing datasets (such as FinTabNet) by increasing the proportion of multi-span cell tables. - **Solution**: By increasing the number of multi-span cell tables, the model's ability to recognize complex tables is improved. In summary, this paper mainly addresses the issues of low quality, numerous annotation errors, and the difficulty of adapting to diverse application scenarios in existing table recognition datasets. It proposes a new method to generate high-fidelity synthetic datasets, thereby effectively improving the performance of table recognition models.

Synthesizing Realistic Data for Table Recognition

SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

End-to-End Compound Table Understanding with Multi-Modal Modeling

A large-scale dataset for end-to-end table recognition in the wild

Enhancing Table Representations with LLM-powered Synthetic Data Generation

Image-based table recognition: data, model, and evaluation

Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries

Structured Evaluation of Synthetic Tabular Data

Flexible Hybrid Table Recognition and Semantic Interpretation System

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

CTSyn: A Foundational Model for Cross Tabular Data Generation

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

Rethinking Table Structure Recognition Using Sequence Labeling Methods

Latent Diffusion for Guided Document Table Generation

Learning Semantic Annotations for Tabular Data

SynFace: Face Recognition with Synthetic Data

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition