Xiaozhong Lyu,Stefan Grafberger,Samantha Biegel,Shaopeng Wei,Meng Cao,Sebastian Schelter,Ce Zhang
Abstract:Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of large - language models in the Retrieval - Augmented mode, especially when these models rely on external knowledge bases. Specifically, the paper focuses on how to optimize model performance by evaluating the importance of retrieved data points, thereby enhancing the model's performance in tasks such as question - answering and data completion.
### Background and Motivation of the Paper
Large - language models (LLMs) have made significant progress in natural - language - processing tasks, but they have two main drawbacks:
1. **Poor performance on long - tail entities**: LLMs perform poorly on entities that have not been seen during training or cannot be remembered due to network - capacity limitations.
2. **High training costs**: As the number of model parameters increases, the costs of training and fine - tuning also rise sharply.
To overcome these problems, researchers have proposed Retrieval - Augmented Generation (RAG) models. RAG models combine a retriever and a generator and can use information in external knowledge bases to enhance model performance. However, the performance of retrieval - augmented models is highly dependent on the quality of the retrieved data points. If the retrieved data contains errors or noise, the model's performance will be severely affected.
### Main Contributions of the Paper
1. **Multilinear Extension Algorithm**: The paper proposes an algorithm based on multilinear extension for evaluating the importance of retrieved data points. The multilinear extension function can be expressed as:
\[
\tilde{U}(w_1, \ldots, w_M) = \sum_{S \subseteq D_{\text{ret}}} U(S) \prod_{d_i \in S} w_i \prod_{d_i \notin S} (1 - w_i)
\]
where \( U(S) \) is the model's performance on the validation set, and \( w_i \) is the weight of the \( i \) - th data point.
2. **Efficient Computation Method**: The paper proposes an algorithm with polynomial - time complexity that can accurately calculate the importance weights of data points given a retrieval - augmented model and a validation set. In addition, a more efficient (\(\epsilon, \delta\)) - approximate algorithm is introduced.
3. **Experimental Verification**: Experimental results show that by simply pruning or re - weighting the retrieval corpus without further training, the performance of large - language models can be significantly improved. On some tasks, small models (such as GPT - JT) can even outperform large models (such as GPT - 3.5).
4. **Practical Application**: The paper shows that the weights based on multilinear extension can be quickly calculated in practice. Even for a large corpus containing 100 million data points, the calculation can be completed in less than ten minutes.
### Experimental Results
- **Question - Answering Tasks**: On the WikiFact dataset, the average accuracy of the retrieval - augmented small model GPT - JT has increased from 21.4% to 33.3%, approaching the 33.9% of the large model GPT - 3.5. By re - weighting and pruning the retrieval corpus, the accuracy of GPT - JT has further increased to 39.2%.
- **Data Completion Tasks**: On the buy and restaurant datasets, the retrieval - augmented small model GPT - JT performs better than the large model GPT - 3.5. In particular, on the buy dataset, the accuracy of GPT - JT has increased from 78.9% to 81.5%.
### Noise Mitigation
The paper also shows how to mitigate the impact of noise in the retrieval corpus through multilinear - extension weights. Experimental results show that after injecting noise, the model's performance has dropped from 33.3% to 27.0%. By re - weighting and pruning the noisy data sources, the model's performance has increased to 33.0% and 33.5% respectively, even exceeding the performance of the clean corpus.
### Summary
This paper effectively evaluates the importance of retrieved data points by proposing an algorithm based on multilinear extension and significantly improves the performance of retrieval - augmented models by pruning or re - weighting the retrieval corpus. These methods not only improve the model's performance in question - answering and data - completion tasks but also can effectively deal with the problems in the corpus.