Abstract:Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of the Chinese text - to - table generation task (Text - to - Table) in existing datasets and large - language - model (LLMs) applications. Specifically: 1. **Limitations of Datasets**: - Existing text - to - table datasets mainly focus on English and lack research support for non - English languages. - These datasets are usually concentrated in a single domain, such as restaurants, sports or biographies, lacking diversity. - There is a relatively high hallucination rate in some datasets, that is, the generated tables contain information beyond the original document, which will affect the training and evaluation effects of the model. 2. **Application Challenges of Large - Language Models**: - Although large - language models perform well in various natural - language - processing tasks, their performance in text - to - table tasks has not been fully explored. - In zero - shot and fine - tuning scenarios, there is still a significant gap between the performance of large - language models and human judgment. To solve these problems, the paper proposes a Chinese text - to - table benchmark dataset named CT - Eval, aiming to evaluate and improve the performance of large - language models in this task. CT - Eval ensures data quality and diversity in the following ways: - **Data Sources**: Baidu Baike is selected as the data source, covering 28 different domains to ensure data diversity. - **Reducing Hallucinations**: By training a large - language model to identify and filter samples containing hallucinations, and combining manual annotation to further clean up the hallucination information in the validation set and the test set. - **Data Statistics**: CT - Eval contains 88.6K task samples, with an average length of 911.46 Chinese characters, and its hallucination rate is significantly lower than that of other existing datasets. By conducting zero - shot and fine - tuning experiments on multiple open - source and closed - source large - language models, the paper reveals the following key findings: - **Zero - Shot Performance**: GPT - 4 performs best in the zero - shot scenario, but there is still an obvious gap compared with human judgment. - **Fine - Tuning Effects**: After fine - tuning, the performance of open - source large - language models is significantly improved, even exceeding GPT - 4 in the zero - shot scenario, indicating the effectiveness of CT - Eval. - **Ongoing Challenges**: Even after fine - tuning, the hallucination problem still exists in the tables generated by large - language models, which is an important challenge in using large - language models for text - to - table tasks. In conclusion, CT - Eval not only provides researchers with a benchmark dataset for evaluating Chinese text - to - table capabilities, but also provides valuable resources for improving the performance of large - language models in this task.

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models