Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers" rel="external noopener nofollow">this https URL</a>.

AutoLLM-CARD: Towards a Description and Landscape of Large Language Models

Automatic Generation of Model and Data Cards: A Step Towards Responsible AI

ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models

Large Language Models for Data Annotation: A Survey

Investigations on Scientific Literature Meta Information Extraction Using Large Language Models

An Interdisciplinary Outlook on Large Language Models for Scientific Research

LMDX: Language Model-based Document Information Extraction and Localization

Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field

LawLLM: Law Large Language Model for the US Legal System

Large Language Models for Generative Information Extraction: A Survey

GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models

Transforming Scholarly Landscapes: Influence of Large Language Models on Academic Fields beyond Computer Science

On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

Large Language Models on Graphs: A Comprehensive Survey

Evaluating Large Language Models: A Comprehensive Survey

Large Language Models in Computer Science Education: A Systematic Literature Review

LLMs in Biomedicine: A study on clinical Named Entity Recognition

Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models

Inspire the Large Language Model by External Knowledge on BioMedical Named Entity Recognition

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Unlocking Model Insights: A Dataset for Automated Model Card Generation