Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer

Adar Kahana,Jaya Susan Mathew,Said Bleik,Jeremy Reynolds,Oren Elisha
2024-02-02
Abstract:With the widespread adoption of Large Language Models (LLMs), in this paper we investigate the multilingual capability of these models. Our preliminary results show that, translating the native language context, question and answer into a high resource language produced the best results.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance evaluation method of large language models (LLMs) in multilingual document question - answering tasks. Specifically, the paper focuses on the following points: 1. **Multilingual support**: Currently, most large language models are pre - trained mainly using datasets in English and other high - resource languages, so they usually perform well in these languages but poorly in low - resource languages. The paper explores how to improve the performance of these models in multiple languages, especially those languages that are widely used globally but have fewer data resources. 2. **Impact of translation strategies**: The paper studies the impact of different translation strategies on model performance, including translating the context, questions, and answers in the original language into a high - resource language (such as English), and strategies such as only partial translation (for example, only translating the question or the answer). 3. **Model selection and comparison**: The paper compares the performance of different versions of GPT models (such as GPT - 4 - 32K and GPT - 3.5 - Turbo) in multilingual tasks to evaluate the advantages of the latest models in multilingual support. 4. **Dataset selection**: In order to comprehensively evaluate the multilingual capabilities of the model, the paper uses multiple datasets, including the Stanford Question Answering Dataset (SQuAD), the Cross - Language Question Answering Dataset (XQuAD), the Environmental, Social, and Governance Sustainability Dataset (ESG), and the Hebrew Question Answering Dataset (HeQ). These datasets cover content in different fields and involve multiple languages. Through the above research, the paper aims to provide a systematic evaluation method to help researchers and developers better understand the performance of large language models in a multilingual environment and make improvement suggestions.