Evaluating Large Language Models: A Comprehensive Survey

Zishan Guo,Renren Jin,Chuang Liu,Yufei Huang,Dan Shi,Supryadi,Linhao Yu,Yan Liu,Jiaxuan Li,Bojian Xiong,Deyi Xiong
2023-11-26
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the evaluation issues of large language models (LLMs) across various tasks and ensure the safe and beneficial development of these models. Specifically, the paper focuses on the following aspects: 1. **Knowledge and Capability Evaluation**: This section covers the evaluation of LLMs' knowledge and reasoning abilities, including question answering, knowledge completion, and various reasoning tasks (such as commonsense reasoning, logical reasoning, multi-hop reasoning, and mathematical reasoning). Through comprehensive evaluation of these capabilities, researchers can better understand the performance of LLMs in practical applications. 2. **Alignment Evaluation**: This section explores the evaluation of LLMs in terms of ethics, bias, toxicity, and truthfulness. Since LLMs may generate harmful or misleading content, it is necessary to rigorously assess whether their outputs align with human values and social norms. 3. **Safety Evaluation**: This section focuses on the robustness and risk aspects of LLMs. As LLMs approach artificial general intelligence (AGI), evaluating their safety becomes particularly important to prevent potential catastrophic risks. 4. **Evaluation of LLMs in Specialized Domains**: The paper also discusses the application of LLMs in specific professional fields, such as biomedicine, education, law, computer science, and finance. By evaluating the performance of these models in different domains, a better understanding of their practical application effects can be achieved. 5. **Construction of a Comprehensive Evaluation Platform**: Finally, the paper emphasizes the importance of building a comprehensive evaluation platform to cover the evaluation of LLMs' capabilities, alignment, and safety. This helps guide the responsible development of LLMs, maximizing societal benefits while minimizing potential risks. In summary, the goal of this paper is to provide a panoramic view for the research and development of LLMs through comprehensive evaluation methods and benchmarks, promoting further research interest in this field.