Aligners: Decoupling LLMs and Alignment

Lilian Ngweta,Mayank Agarwal,Subha Maity,Alex Gittens,Yuekai Sun,Mikhail Yurochkin

2024-10-04

Abstract:Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We use the same synthetic data to train inspectors, binary miss-alignment classification models to guide a "squad" of multiple aligners. Our empirical results demonstrate consistent improvements when applying aligner squad to various LLMs, including chat-aligned models, across several instruction-following and red-teaming datasets.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the alignment issue between large - language models (LLMs) and human expectations. Specifically: 1. **Alignment challenges**: Although large - language models can perform multiple tasks, they are also prone to hallucinations, generating harmful texts or deviating from users' values and preferences. Therefore, an effective method is required to align these models with human expectations to ensure their safety and practicality in most applications. 2. **High - cost and repetitive alignment**: Current alignment methods usually rely on carefully curated datasets or reinforcement learning with human feedback (RLHF), and need to be repeated for each new model and alignment criteria. This is not only costly but may also have a negative impact on model performance. 3. **Proposed solution**: The paper proposes a new method of decoupling large - language models from the alignment process. By training an "aligner" model, any large - language model can be aligned on - demand, thereby reducing the need for alignment for each new model and minimizing the potential negative impact of alignment on performance. 4. **Flexible data - generation method**: To address different alignment requirements, the paper proposes a method of generating synthetic data using large - language models with appropriate prompts. This method can be flexibly adjusted according to different alignment criteria, thereby training aligner and checker models applicable to various alignment criteria. Through these methods, the paper aims to provide a more efficient and flexible alignment scheme to enhance the safety and practicality of large - language models.

Aligners: Decoupling LLMs and Alignment

Human-Instruction-Free LLM Self-Alignment with Limited Samples

Aligner: Efficient Alignment by Learning to Correct

Decoupled Alignment for Robust Plug-and-Play Adaptation

ABC Align: Large Language Model Alignment for Safety & Accuracy

Your Weak LLM is Secretly a Strong Teacher for Alignment

DeAL: Decoding-time Alignment for Large Language Models

Large Language Model Alignment: A Survey

Latent Distance Guided Alignment Training for Large Language Models

Aligning (Medical) LLMs for (Counterfactual) Fairness

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Progressively Label Enhancement for Large Language Model Alignment

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

IterAlign: Iterative Constitutional Alignment of Large Language Models

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness

Towards Scalable Automated Alignment of LLMs: A Survey

Understanding the Learning Dynamics of Alignment with Human Feedback

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Hybrid Alignment Training for Large Language Models