Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury,Anush Kini,Nagarajan Natarajan
2024-04-12
Abstract:Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive.
Machine Learning,Computation and Language
What problem does this paper attempt to address?