Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Luyi Ma,Xiaohan Li,Zezhong Fan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sushant Kumar,Kannan Achan
2024-10-16
Abstract:Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.
Information Retrieval,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue that single data source models in personalized recommendation systems cannot comprehensively capture the multi-dimensional nature of user behavior and product features. Traditional recommendation models typically rely on a single data source, which limits their ability to understand the complexity of user behavior and product characteristics. To overcome this limitation, the paper proposes a new framework—Triple Modality Fusion (TMF), which aligns visual, textual, and graph data with large language models (LLMs) to achieve multi-behavior recommendation. Specifically, the paper proposes the following points: 1. **Multimodal Data Fusion**: By integrating visual information, textual data, and graph data, it captures the contextual and aesthetic features of products, user interests, detailed product features, and the relationships between product behaviors. 2. **LLM-based Recommendation Model**: Utilizing the powerful capabilities of LLMs, it converts user behavior sequences into natural language input prompts to better understand and predict user behavior. 3. **Modality Fusion Module**: A modality fusion module based on cross-attention and self-attention mechanisms is designed to project data from different modalities into the same embedding space for seamless integration. 4. **Experimental Validation**: Extensive experiments on three benchmark datasets validate the effectiveness of the TMF model in improving recommendation accuracy, and it has been deployed in real production environments. In summary, by introducing the Triple Modality Fusion framework, the paper significantly enhances the performance of multi-behavior recommendation systems, providing new ideas and methods for the development of personalized recommendation systems.