A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni,Piji Li
2024-05-17
Abstract:Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the current insufficient evaluation of large-scale language models (LLMs) in natural language generation (NLG) tasks. Although existing research has explored these models' capabilities in areas such as common sense reasoning, mathematical reasoning, and code generation, there is still a lack of systematic evaluation specifically for NLG tasks. Therefore, this paper aims to fill this gap by conducting a comparative analysis of well-known high-performance LLMs of different architectures and scales (such as ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models). Specifically, the paper selects English and Chinese datasets, covering tasks such as dialogue generation and text summarization, and proposes a general evaluation setup, including input templates and post-processing strategies, to ensure fairness and consistency in the evaluation. Through automatic evaluation results and detailed analysis, the paper hopes to enhance the understanding of instruction and prompt design, thereby better utilizing these models' performance in NLG tasks.