Extensive Self-Contrast Enables Feedback-Free Language Model Alignment

Xiao Liu,Xixuan Song,Yuxiao Dong,Jie Tang

2024-03-31

Abstract:Reinforcement learning from human feedback (RLHF) has been a central technique for recent large language model (LLM) alignment. However, its heavy dependence on costly human or LLM-as-Judge preference feedback could stymie its wider applications. In this work, we introduce Self-Contrast, a feedback-free large language model alignment method via exploiting extensive self-generated negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast leverages the LLM itself to generate massive diverse candidates, and harnesses a pre-trained embedding model to filter multiple negatives according to text similarity. Theoretically, we illustrate that in this setting, merely scaling negative responses can still effectively approximate situations with more balanced positive and negative preference annotations. Our experiments with direct preference optimization (DPO) on three datasets show that, Self-Contrast could consistently outperform SFT and standard DPO training by large margins. And as the number of self-generated negatives increases, the performance of Self-Contrast continues to grow. Code and data are available at

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper introduces a self-supervised large-scale language model alignment method called SELF-CONTRAST, which aims to address the challenge of Reinforcement Learning from Human Feedback (RLHF) techniques relying on expensive human or LLM-as-Judge preference feedback. By utilizing a large number of automatically generated negative samples, this method improves the alignment performance of the model without the need for iterative training or comparative feedback. In traditional RLHF, collecting preference feedback is a time-consuming and costly process. The SELF-CONTRAST method leverages the LLM itself to generate a large number of diverse candidate responses and filters out negative samples with low similarity to the target text using a pre-trained embedding model. Theoretical analysis shows that even with a much larger number of negative samples than positive samples, the optimization performance under a balanced positive and negative preference annotation can be effectively approximated. The experimental results demonstrate that SELF-CONTRAST outperforms both Supervised Fine-Tuning (SFT) objectives and standard Direct Preference Optimization (DPO) training, and the performance continues to improve with an increase in the number of automatically generated negative samples. This indicates the potential of using self-generated negative samples to enhance the efficiency of LLM alignment. In conclusion, the paper addresses the problem of effectively aligning large-scale language models without relying on expensive feedback by proposing a novel, efficient, and scalable method based on self-contrast learning.

Extensive Self-Contrast Enables Feedback-Free Language Model Alignment

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Unsupervised Large Language Model Alignment for Information Retrieval Via Contrastive Feedback

Constructive Large Language Models Alignment with Diverse Feedback

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Aligning Large Language Models via Fine-grained Supervision

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment

Aligning Large Language Models with Self-generated Preference Data

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Progressively Label Enhancement for Large Language Model Alignment

Aligning Language Models with Offline Learning from Human Feedback

Fine-tuning Language Models with Generative Adversarial Feedback

UltraFeedback: Boosting Language Models with High-quality Feedback

Your Weak LLM is Secretly a Strong Teacher for Alignment

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Reasons to Reject? Aligning Language Models with Judgments

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions