ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model

Zezheng Qin
2024-09-13
Abstract:Recommender Systems (RS) play a pivotal role in boosting user satisfaction by providing personalized product suggestions in domains such as e-commerce and entertainment. This study examines the integration of multimodal data text and audio into large language models (LLMs) with the aim of enhancing recommendation performance. Traditional text and audio recommenders encounter limitations such as the cold-start problem, and recent advancements in LLMs, while promising, are computationally expensive. To address these issues, Low-Rank Adaptation (LoRA) is introduced, which enhances efficiency without compromising performance. The ATFLRec framework is proposed to integrate audio and text modalities into a multimodal recommendation system, utilizing various LoRA configurations and modality fusion techniques. Results indicate that ATFLRec outperforms baseline models, including traditional and graph neural network-based approaches, achieving higher AUC scores. Furthermore, separate fine-tuning of audio and text data with distinct LoRA modules yields optimal performance, with different pooling methods and Mel filter bank numbers significantly impacting performance. This research offers valuable insights into optimizing multimodal recommender systems and advancing the integration of diverse data modalities in LLMs.
Information Retrieval,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of recommendation systems, especially in terms of the cold - start problem and multimodal data fusion. Specifically: 1. **Cold - start problem**: Traditional text and audio recommendation systems have difficulty providing accurate personalized recommendations when facing new users or new items due to the lack of sufficient historical data. Although large - language models (LLMs) show potential in this regard, their computational costs are high, and parameter adjustments to adapt to the entire system are computationally impractical and expensive. 2. **Multimodal data fusion**: Most current recommendation systems rely only on data of a single modality (such as text or audio) and ignore the comprehensive use of multiple data modalities (such as text, audio, images, etc.). Although multimodal recommendation systems can provide more comprehensive user and content information, how to effectively integrate information of these different modalities remains a challenge. To solve these problems, the paper proposes the ATFLRec framework, aiming to improve the performance of recommendation systems in the following ways: - **Low - rank adaptation (LoRA)**: Improve efficiency by modifying specific system parameters without affecting the running time of the recommendation system. The LoRA method enables efficient model fine - tuning even in low - GPU - memory settings. - **Multimodal fusion**: Integrate audio and text - modality data into large - language models, using different LoRA configurations and modality - fusion techniques to enhance recommendation performance. The main contributions of the paper include: 1. Proposing a multimodal recommendation system that integrates audio - modality content into large - language models. 2. Exploring the impact of different LoRA modules on large - language models and providing empirical insights into multimodal model fine - tuning. 3. Studying the impact of different audio stacking pooling methods, multimodal data - fusion pooling methods, and the number of filters on the performance of recommendation systems. Through these improvements, ATFLRec can also significantly outperform traditional deep - learning recommendation methods in the case of few - shot learning and achieve better performance in terms of the AUC metric.