Dynamic Evaluation of Large Language Models by Meta Probing Agents

Kaijie Zhu,Jindong Wang,Qinlin Zhao,Ruochen Xu,Xing Xie

2024-06-07

Abstract:Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: <a class="link-external link-https" href="https://github.com/microsoft/promptbench" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper focuses on the evaluation problem of large-scale language models (LLMs). Due to the issue of data contamination, existing evaluation protocols have limited scalability for various scenarios and can only provide overall benchmark results, without fine-grained and multi-faceted analysis. To address this, the paper proposes Meta Probing Agents (MPA), a dynamically evaluation protocol inspired by psychometrics. MPA designs probing and judgment agents to automatically transform the original evaluation problems into new ones based on the three fundamental cognitive abilities: language understanding, problem solving, and domain knowledge. These fundamental abilities are also dynamically configurable, allowing for multi-faceted analysis. The paper mentions that most LLMs perform poorly under MPA evaluation, indicating room for improvement. Through multi-faceted analysis, the study found strong correlations among these three fundamental abilities, and the model size (i.e., model capacity) has an implicit Matthew effect on the correlation of these abilities. Larger models tend to have stronger ability associations. Furthermore, MPA can also serve as a data augmentation method to enhance the performance of LLMs. The study used multiple popular LLM models, including GPT-4-Turbo and GPT-3.5-Turbo, to conduct extensive evaluation and analysis on multiple benchmark datasets. The results showed a significant performance drop on dynamic benchmarks, suggesting potential data contamination issues with the current benchmarks.

Dynamic Evaluation of Large Language Models by Meta Probing Agents

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

DyVal: Graph-informed Dynamic Evaluation of Large Language Models

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Evaluating Large Language Models at Evaluating Instruction Following

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models

A Survey on Evaluation of Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

A Survey on Evaluation of Large Language ModelsJust Accepted

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

What is the best model? Application-driven Evaluation for Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Evaluating Large Language Models: A Comprehensive Survey

Law of the Weakest Link: Cross Capabilities of Large Language Models