Abstract:This paper provides a comprehensive survey of the latest research on multilingual large language models (MLLMs). MLLMs not only are able to understand and generate language across linguistic boundaries, but also represent an important advancement in artificial intelligence. We first discuss the architecture and pre-training objectives of MLLMs, highlighting the key components and methodologies that contribute to their multilingual capabilities. We then discuss the construction of multilingual pre-training and alignment datasets, underscoring the importance of data quality and diversity in enhancing MLLM performance. An important focus of this survey is on the evaluation of MLLMs. We present a detailed taxonomy and roadmap covering the assessment of MLLMs' cross-lingual knowledge, reasoning, alignment with human values, safety, interpretability and specialized applications. Specifically, we extensively discuss multilingual evaluation benchmarks and datasets, and explore the use of LLMs themselves as multilingual evaluators. To enhance MLLMs from black to white boxes, we also address the interpretability of multilingual capabilities, cross-lingual transfer and language bias within these models. Finally, we provide a comprehensive review of real-world applications of MLLMs across diverse domains, including biology, medicine, computer science, mathematics and law. We showcase how these models have driven innovation and improvements in these specialized fields while also highlighting the challenges and opportunities in deploying MLLMs within diverse language communities and application scenarios. We listed the paper related in this survey and publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-Multilingual-LLMs-Papers" rel="external noopener nofollow">this https URL</a>.

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

MaLA-500: Massive Language Adaptation of Large Language Models

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Lens: Rethinking Multilingual Enhancement for Large Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

Multilingual Large Language Models: A Systematic Survey

MM-LLMs: Recent Advances in MultiModal Large Language Models

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Extrapolating Large Language Models to Non-English by Aligning Languages

Towards Multilingual LLM Evaluation for European Languages

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models