Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Wangyue Li,Liangzhi Li,Tong Xiang,Xiao Liu,Wei Deng,Noa Garcia

2024-05-23

Abstract:Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address issues primarily focused on the effectiveness of multiple-choice questions (MCQs) in evaluating large language models (LLMs) and the differences between MCQs and long-form generation questions (LFGQs). Specifically: 1. **Impact of Option Order on LLMs**: The study found that LLMs are highly sensitive to the position of options when answering MCQs, particularly tending to choose the first-listed answer. This preference is consistent across different languages and datasets. 2. **Comparison of MCQs and LFGQs**: By comparing different formats of the same question, the study found a low correlation between the answers to MCQs and LFGQs, indicating that LLMs provide significantly different responses in the two formats for the same question. 3. **Relationship Between Consistency and Accuracy**: The study also explored the relationship between the consistency and accuracy of LLMs, finding that higher answer consistency does not necessarily imply higher accuracy. 4. **Calibration Error Analysis**: By calculating the expected calibration error (ECE), the study found that LLMs perform less stably in the MCQs format compared to the LFGQs format, showing a higher tendency towards overconfidence. In summary, the paper aims to reveal the limitations of MCQs as an evaluation tool and suggests the need for further improvement in current evaluation methods to better measure the capabilities of LLMs.

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Leveraging Large Language Models for Multiple Choice Question Answering

LLMs May Perform MCQA by Selecting the Least Incorrect Option

Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Large Language Models Are Not Robust Multiple Choice Selectors.

Investigating Answerability of LLMs for Long-Form Question Answering

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering

Towards leveraging LLMs for Conditional QA