Abstract:Multimodal recommendation systems integrate diverse multimodal information into the feature representations of both items and users, thereby enabling a more comprehensive modeling of user preferences. However, existing methods are hindered by data sparsity and the inherent noise within multimodal data, which impedes the accurate capture of users' interest preferences. Additionally, discrepancies in the semantic representations of items across different modalities can adversely impact the prediction accuracy of recommendation models. To address these challenges, we introduce a novel diffusion-based contrastive learning framework (DiffCL) for multimodal recommendation. DiffCL employs a diffusion model to generate contrastive views that effectively mitigate the impact of noise during the contrastive learning phase. Furthermore, it improves semantic consistency across modalities by aligning distinct visual and textual semantic information through stable ID embeddings. Finally, the introduction of the Item-Item Graph enhances multimodal feature representations, thereby alleviating the adverse effects of data sparsity on the overall system performance. We conduct extensive experiments on three public datasets, and the results demonstrate the superiority and effectiveness of the DiffCL.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multi - modal recommendation systems: 1. **Data Sparsity**: Existing recommendation systems often face the problem of data sparsity when dealing with interaction data between users and items. This makes it difficult for the model to accurately capture users' interest preferences. 2. **Noise in Multi - modal Data**: The inherent noise in multi - modal data (such as images, texts, etc.) will hinder the model from accurately capturing users' interest preferences. 3. **Differences in Semantic Representations of Different Modalities**: There are differences in semantic representations between different modalities (such as visual and textual), which will affect the accuracy of recommendation model predictions. To solve these problems, the paper proposes a contrastive learning framework (DiffCL) based on diffusion models. The specific methods are as follows: - **Introducing Diffusion Models to Generate Contrastive Views**: By using diffusion models to generate contrastive views, the influence of noise introduced in the self - supervised learning process is effectively reduced. The forward process of the diffusion model gradually adds Gaussian noise, and the backward process restores the original data through denoising, thereby generating meaningful contrastive views. \[ q(x_t|x_{t - 1})=\mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t - 1}, \beta_tI) \] - **Enhancing Semantic Consistency**: Align the semantic information of different modalities by stabilizing ID embeddings to ensure cross - modal semantic consistency. The stability and uniqueness of ID embeddings ensure that the semantics of items remain consistent from different modal perspectives. \[ E_m=\begin{bmatrix} e_u^m\\ e_i^m \end{bmatrix} \] - **Introducing Item - Item Graphs**: Enhance item feature representations by constructing Item - Item graphs to alleviate the adverse effects of data sparsity on system performance. The Item - Item graph uses the KNN algorithm to calculate the similarity between items and selects the K most similar neighbors for connection. \[ S_m^{i, j}=\frac{(f_m^i)^\top f_m^j}{\|f_m^i\|\|f_m^j\|} \] Through these methods, the DiffCL framework can more effectively handle data sparsity and noise problems in multi - modal recommendation systems, while improving semantic consistency between different modalities, thereby enhancing the overall performance of the recommendation system. Experimental results show that DiffCL performs well on multiple public datasets, verifying its superiority and effectiveness.

DiffCL: A Diffusion-Based Contrastive Learning Framework with Semantic Alignment for Multimodal Recommendations