Abstract:Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, PO delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without integrating their optimization objectives, ignoring the opportunities to bridge their paradigm gap and take the strengths from both. To obtain a unified understanding, we interpret SFT and PO with two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of PO with inferior estimation and optimization. PO evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, but it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to solve two main problems in aligning language models (LMs) with human preferences: 1. **The combination problem of Supervised Fine - Tuning (SFT) and Preference Optimization (PO)**: - Although SFT improves training efficiency and PO has better alignment effects, existing practices usually apply them sequentially without integrating the optimization goals of both. This approach ignores the paradigm gap between them and fails to fully utilize their respective advantages. 2. **The core differences between SFT and PO**: - SFT scores predicted words only based on the previous word in the target answer, while PO evaluates the quality of the entire answer generated by the model. Therefore, SFT overestimates the model's capabilities, resulting in poor optimization effects. To solve these problems, the paper proposes a new method - Intuitive Fine - Tuning (IFT), aiming to integrate SFT and PO into a single process. By introducing temporal residual connections, IFT can capture the model's intuitive perception of the entire answer while maintaining the same amount of data and format as SFT, thus achieving higher alignment performance. ### Specific solutions 1. **Model framework**: - The paper uses the Markov Decision Process (MDP) framework to interpret SFT and PO as two subprocesses: Preference Estimation and Transition Optimization. Through this modeling, the author shows that SFT is actually a special case of PO, but with weaker estimation and optimization capabilities. 2. **Intuitive Fine - Tuning (IFT)**: - IFT enables the model, by introducing temporal residual connections, to not only depend on the intermediate state of the target answer but also have an intuitive perception of the entire answer based on the initial instruction when generating each word. This allows IFT to achieve alignment performance comparable to or better than PO while maintaining the data and computational efficiency of SFT. 3. **Experimental verification**: - The author verifies the effectiveness of IFT in multiple benchmark tests, including natural language processing tasks and the Frozen Lake game environment. The experimental results show that IFT performs excellently in generation, reasoning, and fact - following abilities, especially in multiple - choice tasks where its performance is close to or even better than other methods. ### Main contributions 1. **Theoretical explanation**: - Through the MDP framework, the similarities and differences between SFT and two basic PO methods (PPO and DPO) are explained. 2. **Method innovation**: - Intuitive Fine - Tuning (IFT) is introduced, which is a deeply unified version of SFT and PO. IFT uses temporary residual connections to extract the model's generation preferences when given an initial instruction, providing efficiency similar to SFT while achieving performance close to PO. 3. **Experimental verification**: - Through multiple benchmark tests, the performance of IFT on various tasks is verified, especially its significant advantages in generation, reasoning, and fact - following abilities. In addition, the effectiveness of IFT is further verified through the Frozen Lake game. In conclusion, by proposing IFT, this paper solves the core problems of SFT and PO in aligning with human preferences and provides a more efficient and effective solution for the alignment of language models.

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Parameter-Efficient Tuning Helps Language Model Alignment

UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function

Learning or Self-aligning? Rethinking Instruction Fine-tuning

Token-level Direct Preference Optimization

Preference Ranking Optimization for Human Alignment

OPTune: Efficient Online Preference Tuning

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

AIPO: Improving Training Objective for Iterative Preference Optimization

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Learning Dynamics of LLM Finetuning

HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

Preference-grounded Token-level Guidance for Language Model Fine-tuning

TSO: Self-Training with Scaled Preference Optimization