Abstract:Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate "experts" on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. In addition, this layer swapping is simple, inexpensive, and intuitive, as it is based on an interpretative analysis of the most important parameter changes during the fine-tuning of each expert. The ability to successfully re-compose LLMs for cross-lingual transfer in this manner opens up future possibilities to combine model expertise, create modular solutions, and transfer reasoning capabilities across languages all post hoc.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the performance of large - language models (LLMs) on specific tasks, especially mathematical reasoning ability, through zero - shot cross - lingual transfer in languages with limited resources. Since most LLMs are pre - trained mainly on English data, for tasks in non - English languages, especially when task - specific data in a particular language is lacking, the performance of these models is usually poor. To solve this problem, the author proposes a new model merging method - layer swapping. Through this method, a model fine - tuned on English mathematical data can be combined with another model fine - tuned on general instruction data in the target language without additional training, thereby enhancing the mathematical reasoning ability in the target language. Specifically, the methods in the paper are as follows: 1. **Fine - tuning of expert models**: Starting from the same pre - trained model, "mathematical experts" are obtained by fine - tuning on English mathematical data respectively, and "language experts" are obtained by fine - tuning on general instruction data in the target language. 2. **Layer swapping**: Replace the top and bottom several Transformer layers of the mathematical expert with the corresponding layers of the language expert, and keep the layers of the mathematical expert in the middle part. In addition, in order to smooth the transition, several "transition zones" are designed, which are obtained by the weighted average of the parameters of the corresponding layers of the mathematical expert and the language expert. 3. **Evaluation and comparison**: Through experiments on multiple low - resource languages (such as Swahili, Telugu, Bengali and Japanese), evaluate the effectiveness of the layer - swapping method and compare it with a single expert model and traditional model merging methods (such as model soup). Through the above methods, the author aims to explore a simple and low - cost way to effectively improve the performance of LLMs in low - resource languages without task - specific data in the target language.

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

How do Large Language Models Handle Multilingualism?

Unlocking the Potential of Model Merging for Low-Resource Languages

Unconstrained Model Merging for Enhanced LLM Reasoning

Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Learning Language-Specific Layers for Multilingual Machine Translation

Interpreting and Improving Large Language Models in Arithmetic Calculation

An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Pruning Multilingual Large Language Models for Multilingual Inference

Cross-layer Attention Sharing for Large Language Models

Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization

Cross-model Control: Improving Multiple Large Language Models in One-time Training

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

Lens: Rethinking Multilingual Enhancement for Large Language Models

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild