Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Duc-Vu Nguyen,Quoc-Nam Nguyen
2023-11-16
Abstract:In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily explores the symbolic binding capability (MCSB) of large language models (LLMs) in the context of Vietnamese multiple-choice question answering (MCQA) tasks. Specifically: 1. **Evaluating Symbolic Binding Capability**: - The paper evaluates the ability of LLMs to perform symbolic binding in multiple-choice questions under zero-shot, one-shot, and few-shot settings. - The focus is on Vietnamese because there is currently a lack of sufficiently challenging Vietnamese MCQA datasets. 2. **Creating a New Dataset**: - To better assess the MCSB capability of LLMs, the authors created a new high-quality dataset and provided structured input guidelines for LaTeX formulas in the fields of mathematics, physics, chemistry, and biology. - This dataset strictly follows the LaTeX format and can be used to evaluate the performance of both large and small language models. 3. **Model Performance Evaluation**: - The paper conducts a comprehensive evaluation of six well-known LLMs, including BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4. - The results show that the GPT series models perform the best across various settings, with GPT-4 achieving the highest accuracy in zero-shot, one-shot, and five-shot settings. Through these studies, the paper aims to advance the field of Vietnamese natural language processing and provide valuable insights for future educational applications.