Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a standardized evaluation benchmark for the international event prediction capabilities of large language model (LLM) agents. Although large language models are excellent at autonomously collecting world information and performing complex reasoning to solve problems, there is currently no systematic framework to evaluate the reliability and accuracy of these models in predicting international events. This has led to concerns about the reliability of AI predictors, especially in high - risk scenarios. To fill this gap, the paper introduces MIRAI (Multi - Information Prediction Agent Interface), a new benchmark test environment designed to strictly evaluate and enhance the ability of LLM agents to predict international events in the time dimension. MIRAI enables LLM to interact with relational databases and text databases by providing an agent environment, thereby achieving autonomous information collection, processing, and application. Specifically, the features of MIRAI include: 1. **Multi - source information integration**: Agents can autonomously obtain and integrate key information from large global databases. 2. **Code - writing ability**: Write code using domain - specific APIs and libraries to utilize different tools. 3. **Joint reasoning of historical knowledge**: Combine historical knowledge in different formats and times to accurately predict future events. Through comprehensive benchmark tests, MIRAI aims to establish a reliable evaluation framework to evaluate the capabilities of LLM agents in international event prediction, thereby promoting the development of more accurate and reliable models and enhancing the understanding of global dynamics.

MIRAI: Evaluating LLM Agents for Event Forecasting

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection

AgentBench: Evaluating LLMs as Agents

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Approaching Human-Level Forecasting with Language Models

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Can Language Models Use Forecasting Strategies?

An LLM Agent for Automatic Geospatial Data Analysis

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

MindAgent: Emergent Gaming Interaction

Evaluating Cultural and Social Awareness of LLM Web Agents

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents