Abstract:Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

Maam: A Morphology-Aware Alignment Model For Unsupervised Bilingual Lexicon Induction

Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction

Morphologically Aware Word-Level Translation

Unsupervised Bilingual Lexicon Induction Via Latent Variable Models.

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Adversarial Training for Unsupervised Bilingual Lexicon Induction

Research of English-Chinese Alignment at Word Granularity on Parallel Corpora

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision.

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Learning Multilingual Representation for Natural Language Understanding with Enhanced Cross-Lingual Supervision

Earth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction.

Word Alignment Modeling with Context Dependent Deep Neural Network.

Collocation Extraction Using Monolingual Word Alignment Method.

A Relaxed Matching Procedure for Unsupervised BLI

AlignBench: Benchmarking Chinese Alignment of Large Language Models

Better Character Language Modeling Through Morphology

A Hybrid Model for Computational Morphology Application

Aligning Translation-Specific Understanding to General Understanding in Large Language Models