Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Amir Saeidi,Shivanshu Verma,Chitta Baral
2024-04-23
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address issues primarily focused on evaluating the performance of several RL-free methods (such as DPO, KTO, IPO, and CPO) across various tasks. Specifically: 1. **Performance evaluation in different scenarios**: The paper explores the performance of these alignment methods in three different scenarios: - Retaining the Supervised Fine-Tuning (SFT) part; - Skipping the SFT part and directly fine-tuning the pre-trained model; - Skipping the SFT part and using the instruction-tuned model. 2. **Task diversity**: The study covers 13 benchmarks across multiple domains such as dialogue systems, reasoning, mathematical problem-solving, question answering, factuality, and multi-task understanding, including MT-Bench, Big Bench, and Open LLM Leaderboard. 3. **Impact of data volume**: It explores the impact of different scales of training datasets on model performance, finding that smaller data subsets often yield better results. 4. **Comparative analysis**: By comparing the effects of SFT with the aforementioned alignment methods, the paper points out that although certain methods (like KTO) perform well in specific tasks, they do not necessarily outperform simple SFT strategies in multi-task understanding and reasoning. Overall, the paper aims to advance future research by comprehensively evaluating these novel alignment techniques, developing more robust and efficient models to tackle alignment challenges.