Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Duc-Vu Nguyen,Quoc-Nam Nguyen

2023-11-16

Abstract:In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.

Computation and Language

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily explores the symbolic binding capability (MCSB) of large language models (LLMs) in the context of Vietnamese multiple-choice question answering (MCQA) tasks. Specifically: 1. **Evaluating Symbolic Binding Capability**: - The paper evaluates the ability of LLMs to perform symbolic binding in multiple-choice questions under zero-shot, one-shot, and few-shot settings. - The focus is on Vietnamese because there is currently a lack of sufficiently challenging Vietnamese MCQA datasets. 2. **Creating a New Dataset**: - To better assess the MCSB capability of LLMs, the authors created a new high-quality dataset and provided structured input guidelines for LaTeX formulas in the fields of mathematics, physics, chemistry, and biology. - This dataset strictly follows the LaTeX format and can be used to evaluate the performance of both large and small language models. 3. **Model Performance Evaluation**: - The paper conducts a comprehensive evaluation of six well-known LLMs, including BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4. - The results show that the GPT series models perform the best across various settings, with GPT-4 achieving the highest accuracy in zero-shot, one-shot, and five-shot settings. Through these studies, the paper aims to advance the field of Vietnamese natural language processing and provide valuable insights for future educational applications.

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Leveraging Large Language Models for Multiple Choice Question Answering

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

LaVy: Vietnamese Multimodal Large Language Model

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

Spoken Language Intelligence of Large Language Models for Language Learning

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Which Large Language Model should You Use in Vietnamese Education: ChatGPT, Bing Chat, or Bard?

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Efficient Finetuning Large Language Models For Vietnamese Chatbot

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension

LLMs May Perform MCQA by Selecting the Least Incorrect Option

Revisiting Multi-Modal LLM Evaluation

Evaluation of Large Language Models in Thailand’s National Medical Licensing Examination