Cross-lingual Transfer of Reward Models in Multilingual Alignment

Jiwoo Hong,Noah Lee,Rodrigo Martínez-Castaño,César Rodríguez,James Thorne
2024-10-24
Abstract:Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the application and effect of cross - language reward models (RMs) in multilingual alignment. Specifically, the authors are concerned with whether English reward models can be effectively used in non - English languages and how effective this cross - language transfer is. Current research mainly focuses on English, which limits the application of reinforcement learning from human feedback (RLHF) in multilingual environments. Therefore, this study experimentally verifies the performance of English reward models in multiple languages and analyzes the underlying mechanisms, in the hope of providing more effective solutions for multilingual alignment. ### Main problems: 1. **Effectiveness of cross - language transfer**: Investigate whether reward models trained in English can be effectively applied to other languages, especially those non - Latin - based languages. 2. **Multilingual alignment ability**: Explore how cross - language transferred reward models affect multilingual instruction - following ability. 3. **Importance of representation preservation**: Analyze why English reward models can better preserve the representations of the initial multilingual pre - training models (MLMs), thereby achieving better cross - language transfer. ### Solutions: - **Dataset construction**: Use synthetic preference datasets, including five representative English preference datasets, and translate them into four target languages (Spanish, Italian, Korean, Chinese). - **Model training**: Use two state - of - the - art 3B multilingual pre - training models (Llama - 3.2 - 3B - Instruct and Qwen2.5 - 3B - Instruct) for fine - tuning as reward models. - **Evaluation method**: Evaluate the effect of cross - language transfer through the multilingual RewardBench, especially the performance on different task categories (such as chatting, security, reasoning, etc.). ### Experimental results: - **Performance of cross - language transfer**: English reward models perform excellently in multiple languages, with an average accuracy rate 3% - 4% higher than that of target language models. - **Significant improvement in reasoning tasks**: In reasoning tasks, the performance of English reward models is particularly prominent, especially in non - Latin - based languages such as Korean and Chinese, with improvements of 12% and 27% respectively. - **Analysis of representation preservation**: By comparing the hidden states of different languages, it is found that English reward models can better preserve the representational diversity of the initial model, while other languages may lead to homogenization of representations. ### Conclusion: This study experimentally proves that English reward models have strong cross - language transfer capabilities in multilingual environments, especially performing extremely well in reasoning tasks. This provides an efficient and low - cost method for multilingual alignment, that is, using high - quality English preference data to train reward models instead of collecting and annotating data separately for each language. This finding is of great significance for promoting the development of multilingual reinforcement learning.