GAIA: a benchmark for General AI Assistants

Grégoire Mialon,Clémentine Fourrier,Craig Swift,Thomas Wolf,Yann LeCun,Thomas Scialom
2023-11-22
Abstract:We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at <a class="link-external link-https" href="https://huggingface.co/gaia-benchmark" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to introduce the GAIA (General AI Assistants) benchmark, a new benchmark for general AI assistants. Solving the problems posed by GAIA would be considered a significant milestone in the field of AI research. GAIA has designed a series of real-world problems that require AI assistants to possess a range of fundamental capabilities, such as reasoning, multimodal processing, web browsing, and proficiency in tool use. The main contributions of the paper include: 1. **Proposing a benchmark with 466 carefully designed problems**: These problems are conceptually simple for humans but pose challenges to state-of-the-art AI systems. For example, human participants have a success rate of 92% on these problems, while GPT-4 equipped with plugins only achieves a success rate of 15%. 2. **Different goals from current trends**: The design philosophy of GAIA differs from current AI benchmark trends, which tend to seek tasks that are increasingly difficult for humans. GAIA focuses on tasks that are simple for humans but require AI systems to perform complex operations to complete. 3. **Easy-to-create yet challenging problems**: These problems are easy to create but extremely challenging for AI systems. They also have a clear factual answer, allowing for simple and robust automatic evaluation. 4. **Avoiding the limitations of existing benchmarks**: GAIA attempts to avoid some pitfalls in the evaluation of current large language models, such as data contamination and interpretability issues, by designing tasks that are conceptually simple but complex in execution. GAIA aims to rethink the design of benchmarks when evaluating new AI systems, especially those general assistant systems that need to access a diverse and uncertain world. Additionally, GAIA provides methodological guidance for creating new problems for the community to further expand the benchmark and analyzes the successes and shortcomings of some of the most advanced assistant systems.