CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Haoxiang Shi,Jiaan Wang,Jiarong Xu,Cen Wang,Tetsuya Sakai
2024-05-21
Abstract:Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of the Chinese text - to - table generation task (Text - to - Table) in existing datasets and large - language - model (LLMs) applications. Specifically: 1. **Limitations of Datasets**: - Existing text - to - table datasets mainly focus on English and lack research support for non - English languages. - These datasets are usually concentrated in a single domain, such as restaurants, sports or biographies, lacking diversity. - There is a relatively high hallucination rate in some datasets, that is, the generated tables contain information beyond the original document, which will affect the training and evaluation effects of the model. 2. **Application Challenges of Large - Language Models**: - Although large - language models perform well in various natural - language - processing tasks, their performance in text - to - table tasks has not been fully explored. - In zero - shot and fine - tuning scenarios, there is still a significant gap between the performance of large - language models and human judgment. To solve these problems, the paper proposes a Chinese text - to - table benchmark dataset named CT - Eval, aiming to evaluate and improve the performance of large - language models in this task. CT - Eval ensures data quality and diversity in the following ways: - **Data Sources**: Baidu Baike is selected as the data source, covering 28 different domains to ensure data diversity. - **Reducing Hallucinations**: By training a large - language model to identify and filter samples containing hallucinations, and combining manual annotation to further clean up the hallucination information in the validation set and the test set. - **Data Statistics**: CT - Eval contains 88.6K task samples, with an average length of 911.46 Chinese characters, and its hallucination rate is significantly lower than that of other existing datasets. By conducting zero - shot and fine - tuning experiments on multiple open - source and closed - source large - language models, the paper reveals the following key findings: - **Zero - Shot Performance**: GPT - 4 performs best in the zero - shot scenario, but there is still an obvious gap compared with human judgment. - **Fine - Tuning Effects**: After fine - tuning, the performance of open - source large - language models is significantly improved, even exceeding GPT - 4 in the zero - shot scenario, indicating the effectiveness of CT - Eval. - **Ongoing Challenges**: Even after fine - tuning, the hallucination problem still exists in the tables generated by large - language models, which is an important challenge in using large - language models for text - to - table tasks. In conclusion, CT - Eval not only provides researchers with a benchmark dataset for evaluating Chinese text - to - table capabilities, but also provides valuable resources for improving the performance of large - language models in this task.