Abstract:While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and could contribute to instability issues. SuperHF replaces PPO with a simple supervised loss and a Kullback-Leibler (KL) divergence prior. It creates its own training data by repeatedly sampling a batch of model outputs and filtering them through the reward model in an online learning regime. We then break down the reward optimization problem into three components: robustly optimizing the training rewards themselves, preventing reward hacking-exploitation of the reward model that degrades model performance-as measured by a novel METEOR similarity metric, and maintaining good performance on downstream evaluations. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement, highlighting SuperHF's potential as a competitive language model alignment technique.

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Learning or Self-aligning? Rethinking Instruction Fine-tuning

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

LaFFi: Leveraging Hybrid Natural Language Feedback for Fine-tuning Language Models

Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model

A Framework for Fine-Tuning LLMs using Heterogeneous Feedback

An Emulator for Fine-Tuning Large Language Models using Small Language Models

Continual SFT Matches Multimodal RLHF with Negative Supervision

Supervised Fine-Tuning as Inverse Reinforcement Learning

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

SuperHF: Supervised Iterative Learning from Human Feedback

Preference Ranking Optimization for Human Alignment

HFT: Half Fine-Tuning for Large Language Models

Fine-tuning Language Models with Generative Adversarial Feedback

Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection

Self-Evolution Fine-Tuning for Policy Optimization

RRHF: Rank Responses to Align Language Models with Human Feedback