Abstract:AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at <a class="link-external link-http" href="http://betterbench.stanford.edu" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? The paper titled "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishating Best Practices" aims to solve the problems of quality assessment and improvement of AI benchmarks. Specifically, the authors focus on the following aspects: 1. **Evaluating the quality of AI benchmarks**: - Current AI models are becoming more and more common in high - risk environments, so their capabilities and potential risks need to be thoroughly evaluated. - Benchmarks are important tools for measuring these attributes, comparing model performance, tracking progress, and identifying weaknesses in both foundation and non - foundation models. - However, the quality of different benchmarks varies greatly, and many commonly used benchmarks have significant problems, such as not reporting the statistical significance of results or not allowing results to be easily replicated. 2. **Establishing a best - practice framework**: - The authors developed an evaluation framework to comprehensively evaluate the life cycle of AI benchmarks based on 46 best - practice criteria. - By evaluating 24 AI benchmarks through this framework, they found that there are large differences in quality and reproducibility among them. 3. **Providing improvement suggestions and support tools**: - For benchmark developers, a minimum quality assurance checklist is provided to help them follow best practices. - A dynamic benchmark evaluation library (betterbench.stanford.edu) was created to support users in analyzing the quality and applicability of benchmarks. ### Main contributions of the paper - **Proposing a new evaluation framework**: Based on expert interviews and domain literature, an AI benchmark evaluation framework covering 46 criteria was proposed. - **Scoring and analysis**: Scores were given to 16 foundation model (FM) and 8 non - foundation model (non - FM) benchmarks, revealing the quality differences between the two types of models. - **Providing insights**: According to the evaluation results, common problems in current AI benchmark practices were pointed out. - **Formulating a best - practice checklist**: A minimum quality assurance checklist was provided for benchmark developers to guide their improvement work. - **Releasing a dynamic evaluation library**: An online platform was created to allow users to continuously update and analyze the quality of benchmarks. Through these efforts, the paper hopes to promote the development of higher - quality AI benchmarks and provide strong support for research and applications in related fields.

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Aibench: an industry standard ai benchmark suite

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

AIBench Scenario: Scenario-Distilling AI Benchmarking.

AIBench Training: Balanced Industry-Standard AI Training Benchmarking

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

AI Agents That Matter

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

SAIBench: Benchmarking AI for Science

Benchmark Data Repositories for Better Benchmarking

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

AIBench: An Industry Standard Internet Service AI Benchmark Suite

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies

Benchmarks as Microscopes: A Call for Model Metrology

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Benchmarks for Automated Commonsense Reasoning: A Survey

Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals