Abstract:Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model's ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer. We make our code and models publicly available at \url{<a class="link-external link-https" href="https://github.com/cisnlp/LangSAMP" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of cross - language transfer performance of language representations in multilingual pre - trained language models (mPLM). Specifically, the paper addresses the following two main challenges: 1. **Language Neutrality**: - Recent multilingual pre - trained language models (such as XLM - R and mBERT) usually avoid using language embeddings, that is, assign learnable vectors to different languages. This is done to ensure that the model has a consistent set of parameters across all languages and can work seamlessly as a general - purpose text encoder without the need for an input language ID. - However, removing language embeddings increases the burden on token embeddings, requiring them to encode all language - specific information, which may weaken the model's ability to generate more language - neutral representations. 2. **Cross - language Transfer**: - In order to achieve effective cross - language transfer, the model needs to share a unified subspace among different languages. However, research shows that existing mPLM encodes a large amount of language - and script - specific information, which is unfavorable for cross - language transfer. - Therefore, the paper proposes a new method - **Language - Script - Aware Multilingual Pre - training (LANG SAMP)**, which enhances representation learning by introducing language and script embeddings while maintaining a simple architecture. ### Solutions of LANG SAMP The core idea of LANG SAMP is to add language and script embeddings after the output of Transformer blocks, rather than at the input stage. The specific steps are as follows: - **Language and Script Embeddings**: Introduce language embeddings \( E_{\text{Lang}} \in \mathbb{R}^{L \times D} \) and script embeddings \( E_{\text{Script}} \in \mathbb{R}^{S \times D} \) into the model, where \( L \) is the number of languages, \( S \) is the number of scripts, and \( D \) is the embedding dimension. - **Language - Script - Aware Modeling**: In the masked language modeling (MLM) task, add language and script embeddings to the output \( h_i \) of the Transformer block to form the final representation \( o_i = h_i+E_{\text{Lang}}^l + E_{\text{Script}}^s \), and then pass it to the language - modeling head for decoding. The benefits of this method are: - **Reduce Burden**: By sharing the task of encoding language - specific information, the output of the Transformer block becomes more language - neutral. - **Cross - language Transfer**: The improved representation contributes to better cross - language transfer, especially in low - resource languages and non - Latin - script languages. ### Experimental Results Experiments show that the model using the LANG SAMP method performs well in multiple downstream tasks, especially achieving significant improvements in sequence - level tasks such as sentence retrieval and text classification. In addition, the introduction of language and script embeddings is especially helpful for low - resource languages and non - Latin - script languages, verifying the effectiveness of this method. ### Summary The paper addresses the deficiencies of existing mPLM in language neutrality and cross - language transfer by introducing language and script embeddings, and proposes a simple and effective method - LANG SAMP, thereby improving the performance of multilingual pre - trained models.

LangSAMP: Language-Script Aware Multilingual Pretraining

mSLAM: Massively multilingual joint pre-training for speech and text

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Multimodal Pretraining from Monolingual to Multilingual

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

Multilingual Pre-training with Universal Dependency Learning.

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Revisiting Language Encoding in Learning Multilingual Representations

SambaLingo: Teaching Large Language Models New Languages

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Pre-training Universal Language Representation

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Cross-Lingual Language Model Meta-Pretraining

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

LAMP: Label Augmented Multimodal Pretraining

mPMR: A Multilingual Pre-trained Machine Reader at Scale