Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Amir Saeidi,Shivanshu Verma,Chitta Baral

2024-04-23

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address issues primarily focused on evaluating the performance of several RL-free methods (such as DPO, KTO, IPO, and CPO) across various tasks. Specifically: 1. **Performance evaluation in different scenarios**: The paper explores the performance of these alignment methods in three different scenarios: - Retaining the Supervised Fine-Tuning (SFT) part; - Skipping the SFT part and directly fine-tuning the pre-trained model; - Skipping the SFT part and using the instruction-tuned model. 2. **Task diversity**: The study covers 13 benchmarks across multiple domains such as dialogue systems, reasoning, mathematical problem-solving, question answering, factuality, and multi-task understanding, including MT-Bench, Big Bench, and Open LLM Leaderboard. 3. **Impact of data volume**: It explores the impact of different scales of training datasets on model performance, finding that smaller data subsets often yield better results. 4. **Comparative analysis**: By comparing the effects of SFT with the aforementioned alignment methods, the paper points out that although certain methods (like KTO) perform well in specific tasks, they do not necessarily outperform simple SFT strategies in multi-task understanding and reasoning. Overall, the paper aims to advance future research by comprehensively evaluating these novel alignment techniques, developing more robust and efficient models to tackle alignment challenges.

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

Aligning Large Language Models with Counterfactual DPO

Bootstrapping Language Models with DPO Implicit Rewards

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Unintended Impacts of LLM Alignment on Global Representation

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Making Large Language Models Better Reasoners with Alignment

The Hitchhiker's Guide to Human Alignment with *PO

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Aligning CodeLLMs with Direct Preference Optimization

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Learn Your Reference Model for Real Good Alignment

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates