Efficient Benchmarking of Language Models

Yotam Perlitz,Elron Bandel,Ariel Gera,Ofir Arviv,Liat Ein-Dor,Eyal Shnarch,Noam Slonim,Michal Shmueli-Scheuer,Leshem Choshen
2024-04-02
Abstract:The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure -- Decision Impact on Reliability, DIoR for short. We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples. Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to intelligently reduce the computational cost of language model (LMs) evaluation without compromising reliability**. Specifically, as the versatility and performance of language models continue to improve, researchers have developed new benchmarks to comprehensively evaluate the extensive capabilities of these models. However, these benchmarks are usually accompanied by huge computational costs, which may even reach thousands of GPU hours. Nevertheless, there is relatively little research on how to conduct these evaluations efficiently. This paper proposes the concept of "efficient benchmarking", aiming to intelligently reduce computational costs without affecting the reliability of evaluation results. To this end, the authors use the HELM benchmark as a case study to explore the impact of different benchmark design choices on the trade - off between computation and reliability. In addition, they propose a new metric - Decision Impact on Reliability (DIoR) - to evaluate the impact of different decisions on benchmark reliability. ### Main contributions: 1. **Highlighted the balance between computation and reliability** and proposed DIoR as a quantitative metric for measuring the reliability of specific efficiency strategies. 2. **Conducted the first systematic study** to analyze the specific impact of benchmark design on reliability. 3. **Provided an efficient benchmark construction checklist** to guide how to reduce computational costs while maintaining sufficient evaluation reliability. 4. **Proposed a dynamic ranking algorithm** that can significantly reduce the amount of computation and has minimal impact on the original ranking. ### Core problems and solutions: - **High computational cost**: By reducing unnecessary computational tasks (such as reducing the number of samples, optimizing prompt selection, etc.), the cost can be significantly reduced while ensuring the reliability of the evaluation results. - **Reliability issues**: By introducing the DIoR metric, it is ensured that reducing computational costs will not significantly affect the reliability of the evaluation results. For example, it is found that the selection of certain scenarios or sub - scenarios has a greater impact on the results, while the selection of specific examples has a smaller impact. ### Conclusion: This research not only provides theoretical support and practical guidelines for future benchmark design, but also shows how to achieve more efficient evaluation methods by optimizing existing benchmarks (such as HELM), thereby promoting the further development of the language model evaluation field.