Abstract:As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.

What problem does this paper attempt to address?

The paper aims to address the issue of how large language models (LLMs) can better align with human preferences when generating text. Specifically, the paper proposes a new method to optimize language models so that the text they generate aligns with human preferences. Although current large language models perform well in various natural language processing tasks, due to the varying quality of training data, these models may learn some undesirable attributes, such as toxic language. Therefore, further fine-tuning is needed to improve their factual accuracy and make them more aligned with social values. The paper explores methods for learning human preferences from direct outcome datasets, where each sample in such a dataset contains a piece of text and its corresponding numerical outcome, reflecting the reader's reaction to the text. Unlike traditional paired completion data, samples in direct outcome datasets can directly reflect the degree of reader preference for the text. To ensure that the model correctly learns the causal relationship between the text and its outcome, the authors propose viewing language model optimization as a causal inference problem and formalize this causal language optimization problem. They develop a method called Causal Preference Optimization (CPO), which addresses an unbiased surrogate objective function for this problem. Additionally, they extend the CPO method and propose Doubly Robust Causal Preference Optimization (DR-CPO), which reduces the variance of the surrogate objective function while retaining robustness. Experiments demonstrate that the (DR-)CPO method can successfully optimize state-of-the-art language models on direct outcome data, making them better align with human preferences and perform robustly in the presence of confounding factors.

Optimizing Language Models for Human Preferences is a Causal Inference Problem

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Aligning Large Language Models with Counterfactual DPO

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Soft Preference Optimization: Aligning Language Models to Expert Distributions

New Desiderata for Direct Preference Optimization

Statistical Rejection Sampling Improves Preference Optimization

Confronting Reward Model Overoptimization with Constrained RLHF

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Causal Inference for Human-Language Model Collaboration

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Uncertainty-Penalized Direct Preference Optimization

Towards Efficient Exact Optimization of Language Model Alignment

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads