Abstract:Large Language Models (LLMs) have demonstrated exceptional natural language understanding abilities and have excelled in a variety of natural language processing (NLP)tasks in recent years. Despite the fact that most LLMs are trained predominantly in English, multiple studies have demonstrated their comparative performance in many other languages. However, fundamental questions persist regarding how LLMs acquire their multi-lingual abilities and how performance varies across different languages. These inquiries are crucial for the study of LLMs since users and researchers often come from diverse language backgrounds, potentially influencing their utilization and interpretation of LLMs' results. In this work, we propose a systematic way of qualifying the performance disparities of LLMs under multilingual settings. We investigate the phenomenon of across-language generalizations in LLMs, wherein insufficient multi-lingual training data leads to advanced multi-lingual capabilities. To accomplish this, we employ a novel back-translation-based prompting method. The results show that GPT exhibits highly translating-like behaviour in multilingual settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the multilingual capabilities of large - language models (LLMs) and how their performance varies across languages. Although most LLMs are mainly trained on English data, their performance in multiple languages has been demonstrated. However, fundamental questions still remain regarding how LLMs acquire their multilingual capabilities and how these capabilities affect performance in different languages. These questions are crucial for researchers and users, as they come from different language backgrounds, which may influence how they use LLMs and interpret their outputs. Therefore, the paper proposes a systematic approach to qualitatively and quantitatively evaluate the multilingual capabilities of LLMs and investigates the phenomenon of cross - language generalization through the new method of prompt back - translation, that is, how limited multilingual training data can lead to advanced multilingual capabilities. Specifically, the paper focuses on the following points: 1. **Classification of Multilingual Capabilities**: The paper proposes to divide language - dependent tasks into three categories - Reasoning, Knowledge Access, and Articulation - to analyze the impact of different languages on task performance. 2. **Translation Invariance and Translation Variability**: The paper introduces the concepts of Translation Equivariant (TE) and Translation Variant (TV) tasks to evaluate the performance consistency of tasks between different languages. 3. **Experimental Methods**: The paper uses Prompt Translation (PT) and Response Back - Translation (RBT) methods to measure the performance of LLMs on different languages and their consistency. Through these methods, the paper aims to reveal the behavioral patterns of LLMs when handling multilingual tasks, especially whether they exhibit composite, coordinated, or subordinate multilingual capabilities.

Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs

Why Not Transform Chat Large Language Models to Non-English?

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs

Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries

ChatGPT Alternative Solutions: Large Language Models Survey

Document-Level Machine Translation with Large Language Models

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Grammaticality Representation in ChatGPT as Compared to Linguists and Laypeople

Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study

The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances

Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting

Is ChatGPT Reliable in Scoring Learner's Translation Quality?

An exploratory survey about using ChatGPT in education, healthcare, and research

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions