SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

Ethan Bradley,Muhammad Roman,Karen Rafferty,Barry Devereux
2024-12-05
Abstract:Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenging artificial intelligence problem of extracting table information from document images, especially in the financial field. Specifically, the paper focuses on the following key issues: 1. **Lack of labeled data**: Existing table extraction datasets are usually concentrated in the scientific field because academic articles and their source codes are easy to obtain. However, tables in the financial field are very different from scientific tables in layout and typesetting, which results in the existing datasets not being well - suited for the task of financial table extraction. 2. **Limitations of OCR**: Many existing table extraction datasets rely on optical character recognition (OCR) technology to extract the text in the table and its position information. However, OCR is not always accurate when dealing with text in table format, especially in complex or irregular table structures, which affects the performance of downstream tasks. 3. **Need for accurate spatial information**: In order to train modern machine - learning models for natural - language - processing tasks, especially those involving table visual question - answering tasks, it is necessary to include the exact 2D position information of each word in the table. Existing datasets often lack such detailed position annotations. To solve the above problems, the authors propose **SynFinTabs**, a large - scale, labeled synthetic financial table dataset. The characteristics of this dataset are as follows: - **Accurate annotation information**: SynFinTabs contains HTML, JSON, and CSV representations of each table, and each word, cell, and row is annotated with its bounding - box position in the image. - **Diverse table styles**: The dataset covers six different theme styles, simulating the table styles in corporate annual reports, financial statements, and spreadsheets, increasing the diversity and complexity of the data. - **Support for multiple tasks**: This dataset can be used to train models to perform tasks such as table - structure recognition, table detection, and table visual question - answering. In addition, the authors also created a layout large - language model named **FinTabQA**, which is specifically used for the extractive question - answering tasks of table content, and verified the effectiveness of this model through experiments.