Abstract:The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

AI-Olympics: Exploring the Generalization of Agents through Open Competitions

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Official International Mahjong: A New Playground for AI Research

ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic Decision-Making with AI Agents

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Mahjong AI Competition: Exploring AI Application in Complex Real-World Games

CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents

Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO

CompeteAI: Understanding the Competition Dynamics of Large Language Model-based Agents

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

AI in Human-computer Gaming: Techniques, Challenges and Opportunities

The Overcooked Generalisation Challenge

Botzone: an Online Multi-agent Competitive Platform for AI Education

The AI Driving Olympics at NeurIPS 2018

A research of artificial intelligence game agent application

Evaluation of OpenAI o1: Opportunities and Challenges of AGI

AI Olympics challenge with Evolutionary Soft Actor Critic

Agents: An Open-source Framework for Autonomous Language Agents

The Application and Development of Artificial Intelligence and High Technology in Sports Event

Multi-Agent, Human-Agent and Beyond: A Survey on Cooperation in Social Dilemmas