Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Lucas Bandarkar,Benjamin Muller,Pritish Yuvraj,Rui Hou,Nayan Singhal,Hongjiang Lv,Bing Liu
2024-10-02
Abstract:Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate "experts" on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. In addition, this layer swapping is simple, inexpensive, and intuitive, as it is based on an interpretative analysis of the most important parameter changes during the fine-tuning of each expert. The ability to successfully re-compose LLMs for cross-lingual transfer in this manner opens up future possibilities to combine model expertise, create modular solutions, and transfer reasoning capabilities across languages all post hoc.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance of large - language models (LLMs) on specific tasks, especially mathematical reasoning ability, through zero - shot cross - lingual transfer in languages with limited resources. Since most LLMs are pre - trained mainly on English data, for tasks in non - English languages, especially when task - specific data in a particular language is lacking, the performance of these models is usually poor. To solve this problem, the author proposes a new model merging method - layer swapping. Through this method, a model fine - tuned on English mathematical data can be combined with another model fine - tuned on general instruction data in the target language without additional training, thereby enhancing the mathematical reasoning ability in the target language. Specifically, the methods in the paper are as follows: 1. **Fine - tuning of expert models**: Starting from the same pre - trained model, "mathematical experts" are obtained by fine - tuning on English mathematical data respectively, and "language experts" are obtained by fine - tuning on general instruction data in the target language. 2. **Layer swapping**: Replace the top and bottom several Transformer layers of the mathematical expert with the corresponding layers of the language expert, and keep the layers of the mathematical expert in the middle part. In addition, in order to smooth the transition, several "transition zones" are designed, which are obtained by the weighted average of the parameters of the corresponding layers of the mathematical expert and the language expert. 3. **Evaluation and comparison**: Through experiments on multiple low - resource languages (such as Swahili, Telugu, Bengali and Japanese), evaluate the effectiveness of the layer - swapping method and compare it with a single expert model and traditional model merging methods (such as model soup). Through the above methods, the author aims to explore a simple and low - cost way to effectively improve the performance of LLMs in low - resource languages without task - specific data in the target language.