What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance evaluation method of large language models (LLMs) in multilingual document question - answering tasks. Specifically, the paper focuses on the following points: 1. **Multilingual support**: Currently, most large language models are pre - trained mainly using datasets in English and other high - resource languages, so they usually perform well in these languages but poorly in low - resource languages. The paper explores how to improve the performance of these models in multiple languages, especially those languages that are widely used globally but have fewer data resources. 2. **Impact of translation strategies**: The paper studies the impact of different translation strategies on model performance, including translating the context, questions, and answers in the original language into a high - resource language (such as English), and strategies such as only partial translation (for example, only translating the question or the answer). 3. **Model selection and comparison**: The paper compares the performance of different versions of GPT models (such as GPT - 4 - 32K and GPT - 3.5 - Turbo) in multilingual tasks to evaluate the advantages of the latest models in multilingual support. 4. **Dataset selection**: In order to comprehensively evaluate the multilingual capabilities of the model, the paper uses multiple datasets, including the Stanford Question Answering Dataset (SQuAD), the Cross - Language Question Answering Dataset (XQuAD), the Environmental, Social, and Governance Sustainability Dataset (ESG), and the Hebrew Question Answering Dataset (HeQ). These datasets cover content in different fields and involve multiple languages. Through the above research, the paper aims to provide a systematic evaluation method to help researchers and developers better understand the performance of large language models in a multilingual environment and make improvement suggestions.

Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer

How do Large Language Models Handle Multilingualism?

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

Leveraging Large Language Models for Multiple Choice Question Answering

Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

What do Large Language Models Need for Machine Translation Evaluation?

Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

Evaluation of medium-large Language Models at zero-shot closed book generative question answering

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Multilingual Large Language Models: A Systematic Survey

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Large Language Model-Based Evaluation of Medical Question Answering Systems: Algorithm Development and Case Study

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Evaluating Large Language Models: A Comprehensive Survey

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis