One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Yutao Zhu,Zhaoheng Huang,Zhicheng Dou,Ji-Rong Wen
2024-06-08
Abstract:Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across nine question-answering tasks demonstrate the superiority of our approach.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issues of hallucination, outdated, or inaccurate content that large language models (LLMs) may generate, especially in scenarios requiring long-tail knowledge. To tackle this challenge, the paper proposes a new method called SPRING, which enhances the performance of LLMs in retrieval-augmented generation (RAG) scenarios by introducing trainable virtual tokens, while maintaining their general generation capabilities. Specifically, the SPRING method has the following features: 1. **Lightweight and Efficient**: By only adjusting the added virtual token embeddings without updating the entire LLM parameters, SPRING enhances performance while remaining lightweight. 2. **Scalability**: SPRING's training method allows for adjusting the number of virtual tokens according to the needs of the inference scenario, significantly improving performance even with just 1 token. 3. **Plug-and-Play**: Due to its lightweight design, SPRING can simply add virtual tokens to enhance performance when retrieval is triggered, and omit these tokens in non-RAG scenarios, thereby preserving the original generation capabilities of LLMs. 4. **Strong Generalization**: SPRING's robust training strategy enables it to adapt to different retrievers and varying numbers of retrieval results, without the need for retraining every time the retrieval system is updated. Experimental results show that SPRING not only effectively improves the performance of LLMs in RAG tasks but also successfully retains their general generation capabilities in non-RAG tasks. Additionally, SPRING outperforms other methods across various tasks and demonstrates good adaptability and robustness to different retrievers.