Abstract:Large language models (LLMs) offer powerful capabilities but also introduce significant risks. One way to mitigate these risks is through comprehensive pre-deployment evaluations using benchmarks designed to test for specific vulnerabilities. However, the rapidly expanding body of LLM benchmark literature lacks a standardized method for documenting crucial benchmark details, hindering consistent use and informed selection. BenchmarkCards addresses this gap by providing a structured framework specifically for documenting LLM benchmark properties rather than defining the entire evaluation process itself. BenchmarkCards do not prescribe how to measure or interpret benchmark results (e.g., defining ``correctness'') but instead offer a standardized way to capture and report critical characteristics like targeted risks and evaluation methodologies, including properties such as bias and fairness. This structured metadata facilitates informed benchmark selection, enabling researchers to choose appropriate benchmarks and promoting transparency and reproducibility in LLM evaluation.

What problem does this paper attempt to address?

The problem this paper attempts to address is the lack of standardized documentation in the pre-deployment evaluation of large language models (LLMs). Specifically, the paper points out that although LLMs demonstrate powerful capabilities in many areas, they also pose significant risks, such as generating biased, harmful, or misleading content, which can undermine public trust and even be used for malicious activities. To effectively assess these risks, researchers need to use standardized benchmarks. However, the current literature lacks a standardized method for documenting the key details of these benchmarks, leading to difficulties in selecting and interpreting benchmark results. The paper proposes a new framework called BenchmarkCards, aimed at filling this gap by providing structured metadata. This metadata includes key information such as target risks, evaluation methods, and dataset characteristics, thereby promoting transparency and reproducibility, helping researchers choose appropriate benchmarks, and fostering a comprehensive understanding of LLM performance. In summary, the main objectives of the paper are: 1. **Standardized Documentation**: Provide a standardized template for recording and reporting the key attributes of LLM benchmarks, including target risks and potential limitations. 2. **Promote Transparency and Reproducibility**: Help users better understand and interpret evaluation results by thoroughly documenting the data sources, evaluation methods, and potential risks of benchmarks. 3. **Support Decision-Making**: Assist researchers, developers, and policymakers in making more informed decisions when selecting and using benchmarks. Addressing these issues will help improve the transparency and reliability of LLM evaluations, reduce potential risks, and promote the healthy development of LLM technology.

BenchmarkCards: Large Language Model and Risk Reporting

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Benchmarking Benchmark Leakage in Large Language Models

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

Risk Aware Benchmarking of Large Language Models

Assessing Language Model Deployment with Risk Cards

Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Are Large Language Models Memorizing Bug Benchmarks?

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Benchmark Data Contamination of Large Language Models: A Survey

Don't Make Your LLM an Evaluation Benchmark Cheater

Enterprise Benchmarks for Large Language Model Evaluation

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Benchmarking Cognitive Biases in Large Language Models as Evaluators

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models