Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Aidar Myrzakhan,Sondos Mahmoud Bsharat,Zhiqiang Shen

2024-06-12

Abstract:Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this ''selection bias'' by simply permutating options on a few test samples and applying to new ones. Another problem of MCQ is the lottery ticket choice by ''random guessing''. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs' performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/VILA-Lab/Open-LLM-Leaderboard" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issues of selection bias and random guessing in the evaluation of large language models (LLMs). The currently common multiple-choice question (MCQ) evaluation method has the following problems: models tend to choose certain specific options (such as A, B, C, D), which is known as selection bias; additionally, small-scale models may obtain correct answers in multiple-choice questions through random guessing, which cannot truly reflect the model's capabilities. To solve these problems, the paper proposes a novel open-style questions evaluation benchmark and constructs the Open-LLM-Leaderboard to track the performance of different LLMs. Open-style questions can fundamentally eliminate the issues of selection bias and random guessing, while better evaluating the model's generative and comprehension abilities. Furthermore, the paper designs an automatic screening and verification process to ensure the validity and accuracy of open-style questions.

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Leveraging Large Language Models for Multiple Choice Question Answering

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment.

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

LLMs May Perform MCQA by Selecting the Least Incorrect Option

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments