Abstract:Current multimodal recommendation models have extensively explored the effective utilization of multimodal information; however, their reliance on ID embeddings remains a performance bottleneck. Even with the assistance of multimodal information, optimizing ID embeddings remains challenging for ID-based Multimodal Recommender when interaction data is sparse. Furthermore, the unique nature of item-specific ID embeddings hinders the information exchange among related items and the spatial requirement of ID embeddings increases with the scale of item. Based on these limitations, we propose an ID-free MultimOdal TOken Representation scheme named MOTOR that represents each item using learnable multimodal tokens and connects them through shared tokens. Specifically, we first employ product quantization to discretize each item's multimodal features (e.g., images, text) into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features, introducing a new Token Cross Network to capture the implicit interaction patterns among these tokens. The resulting representations can replace the original ID embeddings and transform the original ID-based multimodal recommender into ID-free system, without introducing any additional loss design. MOTOR reduces the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model's recommendation capability. Extensive experiments on nine mainstream models demonstrate the significant performance improvement achieved by MOTOR, highlighting its effectiveness in enhancing multimodal recommendation systems.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in current multimodal recommendation systems: 1. **Information silos**: The independent ID embeddings of each item hinder the information exchange between related items. 2. **Cold - start problem**: For new items with very little interaction data, their ID embeddings are difficult to optimize. 3. **Storage burden**: As the number of items increases, the storage requirements for ID embeddings also increase accordingly. To solve these problems, the authors propose an ID - free multimodal token representation scheme (MOTOR). MOTOR is implemented through the following steps: - **Feature discretization**: First, use the optimized Product Quantization (OPQ) technique to discretize the multimodal features (such as images, text) of each item into discrete token IDs. - **Token embedding**: Then interpret the token embeddings corresponding to these token IDs as implicit item features, and introduce a new Token Cross Network to capture the implicit interaction patterns between these tokens. - **Replace ID embeddings**: The finally generated representation can replace the original ID embeddings, converting the ID - based multimodal recommendation system into an ID - free system without introducing any additional loss design. The main contributions of MOTOR include: - **Innovative ID - free multimodal token representation**: This is the first time that quantization techniques have been applied to multimodal recommendation systems to learn item representations through learnable multimodal token crossovers. - **Light - weight Token Cross Network**: A lightweight network is designed to explore the interactions between tokens, and the performance of the Token Cross Network for specific modalities and cross - modalities is experimentally evaluated. - **Significant performance improvement**: Extensive experiments on nine mainstream models show that MOTOR can significantly improve the performance of the recommendation system on both long - tail and popular items, and is compatible with multiple multimodal recommendation models without introducing additional loss design. Through these methods, MOTOR not only reduces the overall space requirements of the model, but also promotes the information exchange between related items, significantly enhancing the model's recommendation ability.

Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation

Interest-Related Item Similarity Model Based on Multimodal Data for Top-N Recommendation

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation

MIC: Model-agnostic Integrated Cross-channel Recommender

Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation

Personalized Item Representations in Federated Multimodal Recommendation

GUME: Graphs and User Modalities Enhancement for Long-Tail Multimodal Recommendation

Beyond Co-occurrence: Multi-modal Session-based Recommendation

MMGRec: Multimodal Generative Recommendation with Transformer Model

Multimodal Interactive Network for Sequential Recommendation

Disentangling ID and Modality Effects for Session-based Recommendation

Multimodal Sparse Linear Integration for Content-Based Item Recommendation

End-to-end training of Multimodal Model and ranking Model

Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation

Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

How to Learn Item Representation for Cold-Start Multimedia Recommendation?

MM-FRec: Multi-Modal Enhanced Fashion Item Recommendation