Abstract:With the development of multimedia systems, multimodal recommendations are playing an essential role, as they can leverage rich contexts beyond interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features; However, there exist semantic gaps among multimodal content features and ID-based features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a specific objective function and is integrated into our multimodal recommendation framework. To effectively train AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with these features as input. As it is essential to analyze whether each multimodal feature helps in training and accelerate the iteration cycle of recommendation models, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by AlignRec are better than currently used ones, which are to be open-sourced in our repository <a class="link-external link-https" href="https://github.com/sjtulyf123/AlignRec_CIKM24" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the alignment issue in multimodal recommendation systems and proposes a new method called AlignRec. Specifically: 1. **Alignment Issue**: Existing multimodal recommendation methods mainly use image and text information as auxiliary features to help learn ID features. However, there is a semantic gap between these modalities, and directly using multimodal information can lead to inconsistencies in user and item representations. 2. **Solution**: The paper systematically studies the alignment issue in multimodal recommendations and proposes a solution—AlignRec. AlignRec decomposes the recommendation objective into three alignment tasks: - Inter-Content Alignment (ICA): Unifying the representation of different modalities through a cross-modal encoder. - Content-Category Alignment (CCA): Using contrastive learning to narrow the gap between multimodal content features and user/item ID features. - User-Item Alignment (UIA): Aligning users with the items they have interacted with through cosine similarity. 3. **Training Strategy**: To effectively train AlignRec, the authors propose pre-training the inter-content alignment task first, and then using the multimodal features obtained from pre-training for joint training of the subsequent two alignment tasks. 4. **Evaluation Protocols**: The paper also designs three new intermediate evaluation protocols to directly assess the effectiveness of multimodal features, including zero-shot evaluation, item-based collaborative filtering, and masked modality recommendation, to select better multimodal encoders and reduce the complexity of hyperparameter search. Through the above methods, the paper aims to improve the performance of multimodal recommendation systems, especially in long-tail items or cold-start scenarios. Experimental results show that AlignRec outperforms nine baseline methods on three real-world datasets.

AlignRec: Aligning and Training in Multimodal Recommendations

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

DRepMRec: A Dual Representation Learning Framework for Multimodal Recommendation

End-to-end training of Multimodal Model and ranking Model

Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation

Collaborative Semantic Alignment in Recommendation Systems

BiVRec: Bidirectional View-based Multimodal Sequential Recommendation

MMRec: Simplifying Multimodal Recommendation

ControlRec: Bridging the Semantic Gap between Language Model and Personalized Recommendation

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

FMMRec: Fairness-aware Multimodal Recommendation

MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders

Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou

RecExplainer: Aligning Large Language Models for Explaining Recommendation Models

Multimodal Difference Learning for Sequential Recommendation

DaRec: A Disentangled Alignment Framework for Large Language Model and Recommender System

Dual-view multi-modal contrastive learning for graph-based recommender systems

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey