Aligners: Decoupling LLMs and Alignment

Lilian Ngweta,Mayank Agarwal,Subha Maity,Alex Gittens,Yuekai Sun,Mikhail Yurochkin
2024-10-04
Abstract:Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We use the same synthetic data to train inspectors, binary miss-alignment classification models to guide a "squad" of multiple aligners. Our empirical results demonstrate consistent improvements when applying aligner squad to various LLMs, including chat-aligned models, across several instruction-following and red-teaming datasets.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the alignment issue between large - language models (LLMs) and human expectations. Specifically: 1. **Alignment challenges**: Although large - language models can perform multiple tasks, they are also prone to hallucinations, generating harmful texts or deviating from users' values and preferences. Therefore, an effective method is required to align these models with human expectations to ensure their safety and practicality in most applications. 2. **High - cost and repetitive alignment**: Current alignment methods usually rely on carefully curated datasets or reinforcement learning with human feedback (RLHF), and need to be repeated for each new model and alignment criteria. This is not only costly but may also have a negative impact on model performance. 3. **Proposed solution**: The paper proposes a new method of decoupling large - language models from the alignment process. By training an "aligner" model, any large - language model can be aligned on - demand, thereby reducing the need for alignment for each new model and minimizing the potential negative impact of alignment on performance. 4. **Flexible data - generation method**: To address different alignment requirements, the paper proposes a method of generating synthetic data using large - language models with appropriate prompts. This method can be flexibly adjusted according to different alignment criteria, thereby training aligner and checker models applicable to various alignment criteria. Through these methods, the paper aims to provide a more efficient and flexible alignment scheme to enhance the safety and practicality of large - language models.