Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Zongmeng Zhang,Yufeng Shi,Jinhua Zhu,Wengang Zhou,Xiang Qi,Peng Zhang,Houqiang Li
2024-10-22
Abstract:Trustworthiness is an essential prerequisite for the real-world application of large language models. In this paper, we focus on the trustworthiness of language models with respect to retrieval augmentation. Despite being supported with external evidence, retrieval-augmented generation still suffers from hallucinations, one primary cause of which is the conflict between contextual and parametric knowledge. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence and disregards the interference of parametric knowledge. Specifically, we propose a reinforcement learning based algorithm Trustworthy-Alignment, theoretically and experimentally demonstrating large language models' capability of reaching a trustworthy status without explicit supervision on how to respond. Our work highlights the potential of large language models on exploring its intrinsic abilities by its own and expands the application scenarios of alignment from fulfilling human preference to creating trustworthy agents.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the trustworthiness issue of large - language models (LLMs) when using Retrieval - Augmented Generation (RAG). Although RAG enhances the generation ability of language models by combining external evidence, these models are still affected by "hallucinations", that is, the generated content does not match the input or the facts. One of the main reasons for this phenomenon is the conflict between contextual knowledge and parameterized knowledge. The paper proposes a reinforcement - learning - based method - TRUSTWORTHY - ALIGNMENT, aiming to enable retrieval - enhanced language models to respond solely based on external evidence, thereby reducing or eliminating the interference of parameterized knowledge and increasing the model's credibility. Specifically, the research focus of the paper is as follows: 1. **Hypothesis verification**: First, verify whether retrieval - enhanced LLMs have the ability to generate responses based on contextual knowledge and parameterized knowledge. 2. **Algorithm design**: Design a reinforcement - learning algorithm to improve the model's credibility by rewarding the model for relying on contextual knowledge and punishing it for relying on parameterized knowledge. 3. **Performance evaluation**: Evaluate the performance of the proposed algorithm in improving the model's credibility and analyze its possible side effects. Through these studies, the paper aims to explore how LLMs can utilize their inherent capabilities and expand the application scenarios of alignment techniques, from satisfying human preferences to creating trustworthy agents.