Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

Haoran Xu,Amr Sharaf,Yunmo Chen,Weiting Tan,Lingfeng Shen,Benjamin Van Durme,Kenton Murray,Young Jin Kim
2024-06-03
Abstract:Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.
Computation and Language
What problem does this paper attempt to address?
This paper mainly discusses how to improve the performance of medium-scale large language models (LLMs) in machine translation tasks. Although these models have shown some potential in machine translation, their performance is still inferior compared to state-of-the-art traditional encoder-decoder translation models or larger-scale LLMs like GPT-4. The researchers first analyzed the limitations of Supervised Fine-tuning (SFT) method, pointing out that there are quality issues even if the reference data is artificially generated. To address this issue, the paper proposes the Contrastive Preference Optimization (CPO) method. Unlike SFT, CPO does not require the model to mimic the reference translation, but rather trains the model to avoid generating translations that are good enough but not perfect. By performing CPO training on the ALMA model using a dataset with only 22K parallel sentences and adjusting 0.1% of the parameters, the results significantly improved the model's performance. The CPO-trained model (referred to as ALMA-R) performs on par with or even surpasses GPT-4 and the winners of the WMT'21, WMT'22, and WMT'23 test datasets. CPO aims to overcome two fundamental shortcomings of SFT: first, SFT's goal is to reduce the gap between predicted outputs and the gold standard reference, which limits the model's performance; second, SFT lacks a mechanism to prevent the model from making mistakes in translation. Through CPO, the model can learn to generate higher-quality translations and avoid producing translations that are close to perfect but actually flawed. Experiments show that the CPO training method not only has advantages in efficiency and speed but also is very effective in improving translation quality. By performing CPO training on the ALMA model, its performance can reach or exceed the level of GPT-4 and the champions of the WMT competition.