Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin,Wenjie Wang,Yongqi Li,Shuo Yang,Fuli Feng,Yinwei Wei,Tat-Seng Chua
2024-06-04
Abstract:Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data. To tackle these issues, we introduce two objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method based on two scores, i.e., influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of sample removal on the overall performance. To achieve low costs of the data pruning process, we use a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. Empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, the proposed method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue of high resource consumption and time costs when fine-tuning large language models (LLMs) on large-scale recommendation data. Specifically: 1. **Efficient Fine-Tuning**: Due to the gap between the pre-training data of LLMs on recommendation tasks and the actual recommendation tasks, and the continuous updating of recommendation data, frequent fine-tuning of LLMs becomes necessary. However, this requires a large amount of computational resources and time costs, thereby limiting the practicality of LLMs in real-world applications. 2. **Sample Selection**: To solve the above problem, researchers have proposed the task of "data pruning," which aims to identify representative samples from large-scale recommendation data to achieve effective fine-tuning of LLMs. By selecting a small number of representative samples for fine-tuning, time and computational costs can be significantly reduced. 3. **Limitations of Core Set Selection**: Existing core set selection methods (such as heuristic or optimization-based methods) either fail to effectively evaluate the impact of samples on empirical risk or are difficult to apply to large-scale datasets. Moreover, these methods rely on models trained on the entire dataset to select the core set, which becomes infeasible in recommendation systems due to the high training costs of LLMs. To address these issues, the paper proposes a new data pruning method called DEALRec, which combines influence score and effort score to efficiently identify the most influential samples for fine-tuning LLMs, thereby improving fine-tuning efficiency while maintaining good recommendation performance.