MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models

Kailai Yang,Zhiwei Liu,Qianqian Xie,Jimin Huang,Tianlin Zhang,Sophia Ananiadou
2024-10-07
Abstract:Recent advancements in large language models (LLMs) focus on aligning to heterogeneous human expectations and values via multi-objective preference alignment. However, existing methods are dependent on the policy model parameters, which require high-cost repetition of their alignment algorithms for each new policy model, and they cannot expand to unseen objectives due to their static alignment objectives. In this work, we propose Meta-Objective Aligner (MetaAligner), the first policy-agnostic and generalizable method for multi-objective preference alignment. MetaAligner models multi-objective alignment into three stages: (1) dynamic objectives reformulation algorithm reorganizes traditional alignment datasets to supervise the model on performing flexible alignment across different objectives; (2) conditional weak-to-strong correction paradigm aligns the weak outputs of fixed policy models to approach strong outputs with higher preferences in the corresponding alignment objectives, enabling plug-and-play inferences on any policy models, which significantly reduces training costs and facilitates alignment on close-source policy models; (3) generalizable inference method flexibly adjusts target objectives by updating their text descriptions in the prompts, facilitating generalizable alignment to unseen objectives. Experimental results show that MetaAligner achieves significant and balanced improvements in multi-objective alignments on 10 state-of-the-art policy models, and saves up to 93.63% of GPU training hours compared to previous alignment methods. The model also effectively aligns unseen objectives, marking the first step towards generalizable multi-objective preference alignment.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two main challenges in multi - objective preference alignment for large language models (LLMs): 1. **High - cost repeated alignment algorithms**: Existing multi - objective alignment methods rely on the parameters of the policy model, and every time a new policy model is introduced, the high - cost alignment algorithm needs to be re - run. This is incompatible with the rapid iteration and ever - increasing scale of current base models. 2. **Limitations of static alignment objectives**: Existing methods can only perform static alignment on pre - defined objectives and lack the ability to extend to unseen objectives, resulting in poor generalization ability of the alignment method. To solve these problems, the paper proposes **Meta - Objective Aligner (MetaAligner)**, a policy - independent and generalizable multi - objective preference alignment method. MetaAligner achieves multi - objective alignment through three stages: 1. **Dynamic objective restatement algorithm**: Reorganize the traditional alignment data set into a dynamic objective alignment data set, enabling MetaAligner to align flexibly under different objective combinations. 2. **Weak - to - strong conditional correction paradigm**: Align the weak output of the policy model to the strong output with higher preference, thereby achieving plug - and - play reasoning, significantly reducing training costs, and supporting the alignment of closed - source policy models. 3. **Generalized reasoning method**: Flexibly adjust the objectives by updating the objective descriptions in the prompts, enabling MetaAligner to adapt to unseen objectives and implement new alignment strategies through in - context learning. Experimental results show that MetaAligner achieves significant and balanced multi - objective alignment improvements on multiple state - of - the - art policy models while saving up to 93.63% of GPU training time. In addition, MetaAligner can also effectively align unseen objectives, marking the first step towards generalized multi - objective preference alignment.