From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation
Yan Zhuang,Qi Liu,Yuting Ning,Weizhe Huang,Zachary A. Pardos,Patrick C. Kyllonen,Jiyun Zu,Qingyang Mao,Rui Lv,Zhenya Huang,Guanhao Zhao,Zheng Zhang,Shijin Wang,Enhong Chen
2024-08-06
Abstract:As AI systems continue to grow, particularly generative models like Large Language Models (LLMs), their rigorous evaluation is crucial for development and deployment. To determine their adequacy, researchers have developed various large-scale benchmarks against a so-called gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high computational costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Perspective, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time, tailoring the evaluation based on the model's ongoing performance instead of relying on a fixed test set. This paradigm not only provides a more robust ability estimation but also significantly reduces the number of test items required. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation. We propose that adaptive testing will become the new norm in AI model evaluation, enhancing both the efficiency and effectiveness of assessing advanced intelligence systems.
Computation and Language