Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Luyi Ma,Xiaohan Li,Zezhong Fan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sushant Kumar,Kannan Achan

2024-10-16

Abstract:Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

Information Retrieval,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue that single data source models in personalized recommendation systems cannot comprehensively capture the multi-dimensional nature of user behavior and product features. Traditional recommendation models typically rely on a single data source, which limits their ability to understand the complexity of user behavior and product characteristics. To overcome this limitation, the paper proposes a new framework—Triple Modality Fusion (TMF), which aligns visual, textual, and graph data with large language models (LLMs) to achieve multi-behavior recommendation. Specifically, the paper proposes the following points: 1. **Multimodal Data Fusion**: By integrating visual information, textual data, and graph data, it captures the contextual and aesthetic features of products, user interests, detailed product features, and the relationships between product behaviors. 2. **LLM-based Recommendation Model**: Utilizing the powerful capabilities of LLMs, it converts user behavior sequences into natural language input prompts to better understand and predict user behavior. 3. **Modality Fusion Module**: A modality fusion module based on cross-attention and self-attention mechanisms is designed to project data from different modalities into the same embedding space for seamless integration. 4. **Experimental Validation**: Extensive experiments on three benchmark datasets validate the effectiveness of the TMF model in improving recommendation accuracy, and it has been deployed in real production environments. In summary, by introducing the Triple Modality Fusion framework, the paper significantly enhances the performance of multi-behavior recommendation systems, providing new ideas and methods for the development of personalized recommendation systems.

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

MMREC: LLM Based Multi-Modal Recommender System

Collaborative Cross-modal Fusion with Large Language Model for Recommendation

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation

Multi-modal Recommendation Based on Knowledge Graph

NoteLLM-2: Multimodal Large Representation Models for Recommendation

Latent Structure Mining With Contrastive Modality Fusion for Multimedia Recommendation

Personalized Recommendation Systems Powered By Large Language Models: Integrating Semantic Understanding and User Preferences

Aligning Large Language Models with Recommendation Knowledge

End-to-end training of Multimodal Model and ranking Model

Integrating Large Language Models into Recommendation via Mutual Augmentation and Adaptive Aggregation

Collaborative Knowledge Fusion: A Novel Approach for Multi-task Recommender Systems via LLMs

Multi-modal recommendation algorithm fusing visual and textual features

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Where can FDG-PET contribute most to anatomical imaging problems?

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

GUME: Graphs and User Modalities Enhancement for Long-Tail Multimodal Recommendation

Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation

From Abstract to Details