Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Aidar Myrzakhan,Sondos Mahmoud Bsharat,Zhiqiang Shen
2024-06-12
Abstract:Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this ''selection bias'' by simply permutating options on a few test samples and applying to new ones. Another problem of MCQ is the lottery ticket choice by ''random guessing''. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs' performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/VILA-Lab/Open-LLM-Leaderboard" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issues of selection bias and random guessing in the evaluation of large language models (LLMs). The currently common multiple-choice question (MCQ) evaluation method has the following problems: models tend to choose certain specific options (such as A, B, C, D), which is known as selection bias; additionally, small-scale models may obtain correct answers in multiple-choice questions through random guessing, which cannot truly reflect the model's capabilities. To solve these problems, the paper proposes a novel open-style questions evaluation benchmark and constructs the Open-LLM-Leaderboard to track the performance of different LLMs. Open-style questions can fundamentally eliminate the issues of selection bias and random guessing, while better evaluating the model's generative and comprehension abilities. Furthermore, the paper designs an automatic screening and verification process to ensure the validity and accuracy of open-style questions.