Abstract:Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs.

CLMAD: A Chinese Language Model Adaptation Dataset

A Public Chinese Dataset for Language Model Adaptation

Methodology of Adapting Large English Language Models for Specific Cultural Contexts

A Language Model Adaptation Approach Based on Text Classification.

Recurrent Neural Network Based Language Model Adaptation for Accent Mandarin Speech.

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Just-in-time Latent Semantic Adaptation on Language Model for Chinese Speech Recognition Using Web Data

DataComp-LM: In search of the next generation of training sets for language models

Reinforcing Language Model For Speech Translation With Auxiliary Data

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

CLiMP: A Benchmark for Chinese Language Model Evaluation

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Adaptable and Reliable Text Classification using Large Language Models

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

An Empirical Investigation of Domain Adaptation Ability for Chinese Spelling Check Models

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

Full-text Error Correction for Chinese Speech Recognition with Large Language Model