Abstract:Large language models (LLMs) match and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task. We evaluate the effect on human forecasters of two LLM assistants: one designed to provide high-quality ("superforecasting") advice, and the other designed to be overconfident and base-rate neglecting, thus providing noisy forecasting advice. We compare participants using these assistants to a control group that received a less advanced model that did not provide numerical predictions or engaged in explicit discussion of predictions. Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24 percent and 28 percent compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41 percent, compared with 29 percent for the noisy assistant. We further examine whether LLM forecasting augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our data do not consistently support these hypotheses. Our results suggest that access to a frontier LLM assistant, even a noisy one, can be a helpful decision aid in cognitively demanding tasks compared to a less powerful model that does not provide specific forecasting advice. However, the effects of outliers suggest that further research into the robustness of this pattern is needed.

Humans vs. large language models: Judgmental forecasting in an era of advanced AI

Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI

Can Language Models Use Forecasting Strategies?

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

Approaching Human-Level Forecasting with Language Models

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models

Large Language Models: Their Success and Impact

Macroeconomic Forecasting with Large Language Models

Large Language Models Assume People are More Rational than We Really are

Human–Artificial Intelligence Collaboration in Prediction: A Field Experiment in the Retail Industry

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Large Language Models in Consumer Electronic Retail Industry: An AI Product Advisor

Can large language models help predict results from a complex behavioural science study?

Large language models can outperform humans in social situational judgments

Reasoning and Tools for Human-Level Forecasting

The Promise and Peril of Generative AI: Evidence from GPT-4 as Sell-Side Analysts

LLMForecaster: Improving Seasonal Event Forecasts with Unstructured Textual Data

Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review