Abstract:Language models (LMs) like GPT‐3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. Unfortunately, while LMs are increasingly salient, transparency lags behind: models from Google, Microsoft, Meta, OpenAI, and more had not been evaluated in the same way to enable clear comparison. We have developed a new benchmarking approach, Holistic Evaluation of Language Models (HELM), which provides transparency through standardized evaluation. Language models (LMs) like GPT‐3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade‐offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top‐level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models https://crfm.stanford.edu/helm/latest/.

METAL: Towards Multilingual Meta-Evaluation

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Towards Multilingual LLM Evaluation for European Languages

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Holistic Evaluation of Language Models

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Automatic Large Language Model Evaluation Via Peer Review

Style Over Substance: Evaluation Biases for Large Language Models

Evaluating Language Models for Generating and Judging Programming Feedback

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks