Abstract:Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the application and effect of cross - language reward models (RMs) in multilingual alignment. Specifically, the authors are concerned with whether English reward models can be effectively used in non - English languages and how effective this cross - language transfer is. Current research mainly focuses on English, which limits the application of reinforcement learning from human feedback (RLHF) in multilingual environments. Therefore, this study experimentally verifies the performance of English reward models in multiple languages and analyzes the underlying mechanisms, in the hope of providing more effective solutions for multilingual alignment. ### Main problems: 1. **Effectiveness of cross - language transfer**: Investigate whether reward models trained in English can be effectively applied to other languages, especially those non - Latin - based languages. 2. **Multilingual alignment ability**: Explore how cross - language transferred reward models affect multilingual instruction - following ability. 3. **Importance of representation preservation**: Analyze why English reward models can better preserve the representations of the initial multilingual pre - training models (MLMs), thereby achieving better cross - language transfer. ### Solutions: - **Dataset construction**: Use synthetic preference datasets, including five representative English preference datasets, and translate them into four target languages (Spanish, Italian, Korean, Chinese). - **Model training**: Use two state - of - the - art 3B multilingual pre - training models (Llama - 3.2 - 3B - Instruct and Qwen2.5 - 3B - Instruct) for fine - tuning as reward models. - **Evaluation method**: Evaluate the effect of cross - language transfer through the multilingual RewardBench, especially the performance on different task categories (such as chatting, security, reasoning, etc.). ### Experimental results: - **Performance of cross - language transfer**: English reward models perform excellently in multiple languages, with an average accuracy rate 3% - 4% higher than that of target language models. - **Significant improvement in reasoning tasks**: In reasoning tasks, the performance of English reward models is particularly prominent, especially in non - Latin - based languages such as Korean and Chinese, with improvements of 12% and 27% respectively. - **Analysis of representation preservation**: By comparing the hidden states of different languages, it is found that English reward models can better preserve the representational diversity of the initial model, while other languages may lead to homogenization of representations. ### Conclusion: This study experimentally proves that English reward models have strong cross - language transfer capabilities in multilingual environments, especially performing extremely well in reasoning tasks. This provides an efficient and low - cost method for multilingual alignment, that is, using high - quality English preference data to train reward models instead of collecting and annotating data separately for each language. This finding is of great significance for promoting the development of multilingual reinforcement learning.

Cross-lingual Transfer of Reward Models in Multilingual Alignment

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

M-RewardBench: Evaluating Reward Models in Multilingual Settings

Rethinking the Role of Proxy Rewards in Language Model Alignment

Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Prototypical Reward Network for Data-Efficient RLHF

Secrets of RLHF in Large Language Models Part II: Reward Modeling

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Reward-Robust RLHF in LLMs

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

RewardBench: Evaluating Reward Models for Language Modeling

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Post-hoc Reward Calibration: A Case Study on Length Bias

Aligning LLMs with Domain Invariant Reward Models

Prototypical Reward Network for Data-Efficient Model Alignment

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Interpreting Language Reward Models via Contrastive Explanations