Abstract:In many settings, an effective way of evaluating objects of interest is to collect evaluations from dispersed individuals and to aggregate these evaluations together. Some examples are categorizing online content and evaluating student assignments via peer grading. For this data science problem, one challenge is to motivate participants to conduct such evaluations carefully and to report them honestly, particularly when doing so is costly. Existing approaches, notably peer-prediction mechanisms, can incentivize truth telling in equilibrium. However, they also give rise to equilibria in which agents do not pay the costs required to evaluate accurately, and hence fail to elicit useful information. We show that this problem is unavoidable whenever agents are able to coordinate using low-cost signals about the items being evaluated (e.g., text labels or pictures). We then consider ways of circumventing this problem by comparing agents' reports to ground truth, which is available in practice when there exist trusted evaluators---such as teaching assistants in the peer grading scenario---who can perform a limited number of unbiased (but noisy) evaluations. Of course, when such ground truth is available, a simpler approach is also possible: rewarding each agent based on agreement with ground truth with some probability, and unconditionally rewarding the agent otherwise. Surprisingly, we show that the simpler mechanism achieves stronger incentive guarantees given less access to ground truth than a large set of peer-prediction mechanisms.

What's a Good Prediction? Challenges in evaluating an agent's knowledge

The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

Principled Knowledge Extrapolation with GANs.

Estimating Knowledge in Large Language Models Without Generating a Single Token

Predictions as statements and decisions

Discovery of Useful Questions as Auxiliary Tasks

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Conditioning Predictive Models: Risks and Strategies

Evaluating the World Model Implicit in a Generative Model

Do Pre-trained Models Benefit Knowledge Graph Completion? A Reliable Evaluation and a Reasonable Approach.

Limitations of Agents Simulated by Predictive Models

Predicting Future Actions of Reinforcement Learning Agents

Evaluation Gaps in Machine Learning Practice

Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation

Predicting challenge moments from students' discourse: A comparison of GPT-4 to two traditional natural language processing approaches

Language Models (Mostly) Know What They Know

Artificial prediction markets present a novel opportunity for human-AI collaboration

Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

What executives need to know about knowledge management, large language models and generative AI

Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

Incentivizing Evaluation via Limited Access to Ground Truth: Peer-Prediction Makes Things Worse