Abstract:Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.

Evaluating Models' Local Decision Boundaries Via Contrast Sets.

Evaluating NLP Models Via Contrast Sets.

Evaluating Large Language Models Using Contrast Sets: An Experimental Approach

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions

Debiased Contrastive Learning of Unsupervised Sentence Representations

Self-Damaging Contrastive Learning

An Efficient Method of Supervised Contrastive Learning for Natural Language Understanding

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Contrastive Open Set Recognition

Disjoint Contrastive Regression Learning for Multi-Sourced Annotations

Extensive Self-Contrast Enables Feedback-Free Language Model Alignment

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Learning from Crowds with Contrastive Representation

Understanding Contrastive Learning via Distributionally Robust Optimization

Contrast Sets for Evaluating Language-Guided Robot Policies

Contrast Is All You Need