Abstract:Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

Evolutionary Contrastive Distillation for Language Model Alignment

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution

M$^3$IT: A Large-Scale Dataset Towards Multi-Modal Multilingual Instruction Tuning

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

CoEvol: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models