Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the evaluation issues of large language models (LLMs) across various tasks and ensure the safe and beneficial development of these models. Specifically, the paper focuses on the following aspects: 1. **Knowledge and Capability Evaluation**: This section covers the evaluation of LLMs' knowledge and reasoning abilities, including question answering, knowledge completion, and various reasoning tasks (such as commonsense reasoning, logical reasoning, multi-hop reasoning, and mathematical reasoning). Through comprehensive evaluation of these capabilities, researchers can better understand the performance of LLMs in practical applications. 2. **Alignment Evaluation**: This section explores the evaluation of LLMs in terms of ethics, bias, toxicity, and truthfulness. Since LLMs may generate harmful or misleading content, it is necessary to rigorously assess whether their outputs align with human values and social norms. 3. **Safety Evaluation**: This section focuses on the robustness and risk aspects of LLMs. As LLMs approach artificial general intelligence (AGI), evaluating their safety becomes particularly important to prevent potential catastrophic risks. 4. **Evaluation of LLMs in Specialized Domains**: The paper also discusses the application of LLMs in specific professional fields, such as biomedicine, education, law, computer science, and finance. By evaluating the performance of these models in different domains, a better understanding of their practical application effects can be achieved. 5. **Construction of a Comprehensive Evaluation Platform**: Finally, the paper emphasizes the importance of building a comprehensive evaluation platform to cover the evaluation of LLMs' capabilities, alignment, and safety. This helps guide the responsible development of LLMs, maximizing societal benefits while minimizing potential risks. In summary, the goal of this paper is to provide a panoramic view for the research and development of LLMs through comprehensive evaluation methods and benchmarks, promoting further research interest in this field.

Evaluating Large Language Models: A Comprehensive Survey

A Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

Exploring Advanced Methodologies in Security Evaluation for LLMs

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Evaluating large language models in medical applications: a survey

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry

A Survey on Evaluation of Multimodal Large Language Models

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Multilingual Large Language Models: A Systematic Survey

A Comprehensive Overview of Large Language Models

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Survey

A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation

Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods

A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law