LangSAMP: Language-Script Aware Multilingual Pretraining

Yihong Liu,Haotian Ye,Chunlan Ma,Mingyang Wang,Hinrich Schütze
2024-09-27
Abstract:Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model's ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer. We make our code and models publicly available at \url{<a class="link-external link-https" href="https://github.com/cisnlp/LangSAMP" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of cross - language transfer performance of language representations in multilingual pre - trained language models (mPLM). Specifically, the paper addresses the following two main challenges: 1. **Language Neutrality**: - Recent multilingual pre - trained language models (such as XLM - R and mBERT) usually avoid using language embeddings, that is, assign learnable vectors to different languages. This is done to ensure that the model has a consistent set of parameters across all languages and can work seamlessly as a general - purpose text encoder without the need for an input language ID. - However, removing language embeddings increases the burden on token embeddings, requiring them to encode all language - specific information, which may weaken the model's ability to generate more language - neutral representations. 2. **Cross - language Transfer**: - In order to achieve effective cross - language transfer, the model needs to share a unified subspace among different languages. However, research shows that existing mPLM encodes a large amount of language - and script - specific information, which is unfavorable for cross - language transfer. - Therefore, the paper proposes a new method - **Language - Script - Aware Multilingual Pre - training (LANG SAMP)**, which enhances representation learning by introducing language and script embeddings while maintaining a simple architecture. ### Solutions of LANG SAMP The core idea of LANG SAMP is to add language and script embeddings after the output of Transformer blocks, rather than at the input stage. The specific steps are as follows: - **Language and Script Embeddings**: Introduce language embeddings \( E_{\text{Lang}} \in \mathbb{R}^{L \times D} \) and script embeddings \( E_{\text{Script}} \in \mathbb{R}^{S \times D} \) into the model, where \( L \) is the number of languages, \( S \) is the number of scripts, and \( D \) is the embedding dimension. - **Language - Script - Aware Modeling**: In the masked language modeling (MLM) task, add language and script embeddings to the output \( h_i \) of the Transformer block to form the final representation \( o_i = h_i+E_{\text{Lang}}^l + E_{\text{Script}}^s \), and then pass it to the language - modeling head for decoding. The benefits of this method are: - **Reduce Burden**: By sharing the task of encoding language - specific information, the output of the Transformer block becomes more language - neutral. - **Cross - language Transfer**: The improved representation contributes to better cross - language transfer, especially in low - resource languages and non - Latin - script languages. ### Experimental Results Experiments show that the model using the LANG SAMP method performs well in multiple downstream tasks, especially achieving significant improvements in sequence - level tasks such as sentence retrieval and text classification. In addition, the introduction of language and script embeddings is especially helpful for low - resource languages and non - Latin - script languages, verifying the effectiveness of this method. ### Summary The paper addresses the deficiencies of existing mPLM in language neutrality and cross - language transfer by introducing language and script embeddings, and proposes a simple and effective method - LANG SAMP, thereby improving the performance of multilingual pre - trained models.