Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

Fabian David Schmidt,Philipp Borchert,Ivan Vulić,Goran Glavaš

2024-06-19

Abstract:LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translate-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the limitations of large language models (LLMs) in cross-lingual natural language understanding (NLU). Specifically, although LLMs perform excellently on English NLU tasks, their performance significantly drops when dealing with low-resource languages. The paper proposes a method that integrates a machine translation (MT) encoder with LLMs to create a unified multilingual LLM (referred to as MT-LLM) to enhance cross-lingual NLU capabilities. This approach combines the general knowledge of LLMs in English and other high-resource languages with the robust multilingual representation capabilities of the MT encoder, enabling LLMs to better understand and process data in multiple languages. The main contributions include: 1. Successfully integrating the MT encoder into the LLM architecture, achieving zero-shot cross-lingual transfer (ZS-XLT) and outperforming traditional translation test (TTEST) methods across various tasks. 2. Demonstrating that this method is applicable to different types of LLM architectures and improving the model's understanding of specific tasks through self-distillation techniques. 3. Conducting extensive and fair experimental comparisons on multiple NLU tasks and a large number of low-resource languages, showcasing the superiority of MT-LLM in cross-lingual transfer.

Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Simul-LLM: A Framework for Exploring High-Quality Simultaneous Translation with Large Language Models

Investigating Decoder-only Large Language Models for Speech-to-text Translation

TransLLaMa: LLM-based Simultaneous Translation System

Extrapolating Large Language Models to Non-English by Aligning Languages

Stacking Small Language Models for Generalizability

Bootstrapping Multilingual Semantic Parsers using Large Language Models

Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Language Models and Cycle Consistency for Self-Reflective Machine Translation

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation

How do Large Language Models Handle Multilingualism?

Massively Multilingual Shallow Fusion with Large Language Models

RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

Speech Translation with Large Language Models: An Industrial Practice

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis