Latent Distance Guided Alignment Training for Large Language Models

Haotian Luo
2024-04-13
Abstract:Ensuring alignment with human preferences is a crucial characteristic of large language models (LLMs). Presently, the primary alignment methods, RLHF and DPO, require extensive human annotation, which is expensive despite their efficacy. The significant expenses associated with current alignment techniques motivate researchers to investigate the development of annotation-free alignment training methods. In pursuit of improved alignment without relying on external annotation, we introduce Latent Distance Guided Alignment Training (LD-Align). This approach seeks to align the model with a high-quality supervised fine-tune dataset using guidance from a latent space. The latent space is generated through sample reconstruction, akin to auto-encoding. Consequently, we utilize the distance between sample pairs in the latent space to guide DPO-based alignment training. Extensive experimentation and evaluation show the efficacy of our proposed method in achieving notable alignment.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to align large language models (LLMs) with human preferences without relying on human annotations. Current mainstream alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), although effective, require a large amount of expensive human - annotated data. To solve this problem, the author proposes a new alignment training method - Latent Distance Guided Alignment Training (LD - Align). This method guides DPO - based alignment training by quantifying the distances between sample pairs in the latent space, thereby achieving efficient alignment without additional human annotations. ### Specific Problem Description 1. **Limitations of Existing Alignment Methods**: - Mainstream alignment methods such as RLHF and DPO require a large amount of human - annotated data, which is not only costly but also time - consuming. - These methods rely on external annotations or the support of more powerful language models, increasing the difficulty of implementation. 2. **Research Motivation**: - To reduce the dependence on human annotations and lower the cost of alignment training. - To explore a method that can self - align on high - quality supervised fine - tuning datasets to improve the performance of the model and better meet human expectations. ### Solution The LD - Align method proposed by the author mainly includes the following steps: 1. **Constructing the Guiding Model**: - Use an auto - encoder structure to generate a latent space, where the encoder maps the input prompts and responses to multi - dimensional latent vectors, and the decoder attempts to reconstruct the real responses according to the latent vectors. - Calculate the distance between the generated samples and the real samples in the latent space as the guiding signal for alignment training. 2. **Iterative Alignment Training**: - In each iteration, assign larger update weights to samples with larger distances in the latent space and smaller update weights to samples with smaller distances to avoid overfitting. - Through the Direct Preference Optimization (DPO) framework, use the normalized distances between samples as re - weighting terms to adjust the model parameters and gradually improve the model's alignment degree. ### Experimental Results The experimental results show that after three iterations of LD - Align training, the performance of the model on multiple benchmark tests has been significantly improved, with an average improvement of 6.00%, outperforming other unlabeled alignment methods such as SPIN. ### Conclusion LD - Align achieves efficient model alignment without additional human annotations by using the sample distances in the latent space, significantly improving the model's performance on various tasks.