Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers" rel="external noopener nofollow">this https URL</a>.

A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Leveraging LLMs for Dialogue Quality Measurement

Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Exploring the Dialogue Comprehension Ability of Large Language Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Evaluating Large Language Models in Analysing Classroom Dialogue

A Closer Look into Using Large Language Models for Automatic Evaluation

A Survey on Evaluation of Large Language Models

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

A Survey on Evaluation of Large Language ModelsJust Accepted

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Evaluating Large Language Models: A Comprehensive Survey

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Large Language Models Are Active Critics in NLG Evaluation

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue